Dataset research report
Wiki Bio research report
A reproducible data report with schema notes, generated chart evidence, suggested follow-up questions, and export-ready Helix queries.
Executive Summary
This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%).
Follow-Up Queries
Method And Limits
- Load the catalog entry and preview rows from the processed dataset file.
- Infer numeric, categorical, time, and location fields from real columns.
- Generate a small set of defensive Plotly chart specifications from that profile.
- Expose each chart idea as a query link so the report can be rerun or exported in Helix.
This report is intentionally reproducible. It uses the local catalog metadata and generated chart specifications rather than claiming external conclusions beyond the dataset.