Dataset research report
Fineweb research report
A reproducible data report with schema notes, generated chart evidence, suggested follow-up questions, and export-ready Helix queries.
Executive Summary
🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Research Context
Fineweb: 500 rows by 9 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.
Data Profile
Chart Evidence
These views are generated from the dataset profile. Each chart is paired with a Helix query so it can be opened, adjusted, and exported.
language_score over time
Trend of language_score, token_count across date.
Open and export this chartlanguage_score vs token_count
Relationship between language_score and token_count.
Open and export this chartDistribution of language_score
Histogram of language_score values.
Open and export this chartFollow-Up Queries
Preview Rows
| # | texttext | idtext | dumptext | urltext | datetext | file_pathtext | languagetext | language_scorefloat |
|---|---|---|---|---|---|---|---|---|
| 1 | How AP reported in all formats from tornado-stricken regionsMarch 8, 2012 When the first serious bout of tornadoes of 2012 blew through mid… | <urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff> | CC-MAIN-2013-20 | http://%[email protected]/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions | 2013-05-18T05:48:54Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9721 |
| 2 | Did you know you have two little yellow, nine-volt-battery-sized adrenal glands in your body, just chilling out, maxin’, relaxin’ all cool … | <urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6> | CC-MAIN-2013-20 | http://1000awesomethings.com/2012/09/24/934-adrenaline/ | 2013-05-18T08:11:45Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.948 |
| 3 | Car Wash For Clara! Now is your chance to help! 2 year old Clara Woodward has Cancer! Clara can’t say “Neuroblastoma” but she knows how it … | <urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec> | CC-MAIN-2013-20 | http://1027kord.com/car-wash-for-clara/ | 2013-05-18T06:49:55Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9115 |
| 4 | Listeners Get Sky-high View of Missoula From Hot Air Balloons On Friday, June 1, during the Graduation Matters carnival, Townsquare Media –… | <urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7> | CC-MAIN-2013-20 | http://1075zoofm.com/listeners-get-sky-high-view-of-missoula-from-hot-air-balloons/ | 2013-05-18T06:25:20Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9565 |
| 5 | Log In Please enter your ECode to log in. Forgotten your eCode? If you created your login but do not remember your eCode please enter your … | <urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc> | CC-MAIN-2013-20 | http://1105govinfoevents.com/enterprisearchitectureevent/public/MyBriefcasef671.html?ID=563&sortMenu=103001&exp=1%2F26%2F2009+3%3A27%3A46+PM | 2013-05-18T05:27:01Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.7982 |
| 6 | spotlight provides a convenient rechargeable LED light for work play and everyday life. choose from many vibrant colors to match your car, … | <urn:uuid:6bfca20f-ea67-41ba-b995-b7081b4a8b15> | CC-MAIN-2013-20 | http://12vspotlight.com/ | 2013-05-18T06:49:17Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.7543 |
Data Dictionary
- text text
- id text
- dump categorical
- url text
- date datetime
- file_path categorical
- language categorical
- language_score numeric
- token_count numeric
Method And Limits
- Load the catalog entry and preview rows from the processed dataset file.
- Infer numeric, categorical, time, and location fields from real columns.
- Generate a small set of defensive Plotly chart specifications from that profile.
- Expose each chart idea as a query link so the report can be rerun or exported in Helix.
This report is intentionally reproducible. It uses the local catalog metadata and generated chart specifications rather than claiming external conclusions beyond the dataset.