Helix the Robot
Helix Helix
arrow_backAll datasets

Wiki Bio

Hugging Face

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%).

cloud_downloadwiki_bio
boltOpen in Helix

Interesting queries to try

Login to Helix

Don't have an account? Sign up here

Sign Up for Helix

Already have an account? Login here