Fineweb

Name: Fineweb
Creator: Helix
Keywords: dataset, discovered, hugging face, language:en, license:odc-by, modality:tabular, size_categories:10B<n<100B, task_categories:text-generation

Hugging Face

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

huggingfacefw--fineweb.parquet 500 rows HuggingFaceFW/fineweb

Open in Helix Read research report

Ask a question about this data

Type any question in plain English — Helix builds the chart with AI. Sign in to run it and save your charts.

line chart of language_score over date average language_score by dump top 10 dump by total language_score scatter language_score vs token_count coloured by dump

Data preview

500 rows · 9 columns · showing first 12

#	text text	id text	dump text	url text	date text	file_path text	language text	language_score float	token_count integer
1	How AP reported in all formats from tornado-stricken regionsMarch 8, 2012 When the first serious bout of tornadoes of 2012 blew through mid…	<urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff>	CC-MAIN-2013-20	http://%[email protected]/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions	2013-05-18T05:48:54Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9721	717
2	Did you know you have two little yellow, nine-volt-battery-sized adrenal glands in your body, just chilling out, maxin’, relaxin’ all cool …	<urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6>	CC-MAIN-2013-20	http://1000awesomethings.com/2012/09/24/934-adrenaline/	2013-05-18T08:11:45Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.948	821
3	Car Wash For Clara! Now is your chance to help! 2 year old Clara Woodward has Cancer! Clara can’t say “Neuroblastoma” but she knows how it …	<urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec>	CC-MAIN-2013-20	http://1027kord.com/car-wash-for-clara/	2013-05-18T06:49:55Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9115	125
4	Listeners Get Sky-high View of Missoula From Hot Air Balloons On Friday, June 1, during the Graduation Matters carnival, Townsquare Media –…	<urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7>	CC-MAIN-2013-20	http://1075zoofm.com/listeners-get-sky-high-view-of-missoula-from-hot-air-balloons/	2013-05-18T06:25:20Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9565	103
5	Log In Please enter your ECode to log in. Forgotten your eCode? If you created your login but do not remember your eCode please enter your …	<urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc>	CC-MAIN-2013-20	http://1105govinfoevents.com/enterprisearchitectureevent/public/MyBriefcasef671.html?ID=563&sortMenu=103001&exp=1%2F26%2F2009+3%3A27%3A46+PM	2013-05-18T05:27:01Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.7982	75
6	spotlight provides a convenient rechargeable LED light for work play and everyday life. choose from many vibrant colors to match your car, …	<urn:uuid:6bfca20f-ea67-41ba-b995-b7081b4a8b15>	CC-MAIN-2013-20	http://12vspotlight.com/	2013-05-18T06:49:17Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.7543	102
7	K-State put themselves in sole position of first place in the Big 12 with their 79-70 over Iowa State on Saturday, and K-State is now #10 i…	<urn:uuid:dc9d9fd8-5a21-4ab0-bbb2-9718720e1cc2>	CC-MAIN-2013-20	http://1350kman.com/k-state-now-in-top-10/	2013-05-18T07:19:46Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9459	204
8	Five Reasons I Love Boston 1. The water. The Atlantic Ocean, as deep and true as denim, so blue it melts into the sky, horizonless. And the…	<urn:uuid:64f968bf-14bc-48bd-a1bb-a43b3f4a3c3d>	CC-MAIN-2013-20	http://17andbaking.com/2012/09/30/five-reasons-i-love-boston/?like=1&source=post_flair&_wpnonce=fd9f0e7c6a	2013-05-18T07:25:34Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9267	550
9	TRIBE CHIEF TRIS DAZZLES AT DISH, FLUBS IN FIELD IN SIXTH STRAIGHT TIGER WIN By Calvin J. Butterworth June 19, 1924 Ty Cobb can tell you. P…	<urn:uuid:2c08e1d4-9706-41d8-84dc-ee2939758c81>	CC-MAIN-2013-20	http://1924andyouarethere.blogspot.com/2009_07_01_archive.html	2013-05-18T05:54:12Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9648	979
10	\|Tommy Pi - Trance Experience\| \|Written by Paul\| Tommy Pi started DJing at small private parties at the age of 13. He was always into music…	<urn:uuid:7e6216ca-0a01-498d-85f7-7d4aed299c98>	CC-MAIN-2013-20	http://1mix.co.uk/trance-shows/tommy-pi-trance-experience.html	2013-05-18T05:54:17Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.9857	527
11	When I found out we would be getting a PopATot for review I was excited! I knew before we even had it that we would like it, but I had no i…	<urn:uuid:0868921d-8323-4a3d-b012-51c15c046cc1>	CC-MAIN-2013-20	http://1plus1plus1equals1reviews.blogspot.com/2009/10/grand-finale-4-popatot.html?showComment=1256068362402	2013-05-18T06:19:38Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.954	826
12	2012 Indy Info It seems we can’t find what you’re looking for. Perhaps searching can help. giving it all away... where the illusions of sca…	<urn:uuid:b7319126-5fdb-4ae0-a17b-584c071b561c>	CC-MAIN-2013-20	http://2012indyinfo.com/category/sfhs/	2013-05-18T08:07:40Z	s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz	en	0.8378	123

Auto-generated charts

Fineweb: 500 rows by 9 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.

Rows500

Columns9

Numeric cols2

Categorical cols3

Date range2013-05-18 → 2013-05-18

Charts

language_score over time

Trend of language_score, token_count across date.

language_score vs token_count

Relationship between language_score and token_count.

Distribution of language_score

Histogram of language_score values.

Interesting queries to try

Columns

text text
id text
dump categorical
url text
date datetime
file_path categorical
language categorical
language_score numeric
token_count numeric