Fineweb
Hugging Faceπ· FineWeb 15 trillion tokens of the finest data the π web has to offer What is it? The π· FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the π datatrove library, our large scale data processing library. π· FineWeb was originally meant to be a fully open replication of π¦ RefinedWeb, with a releaseβ¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Ask a question about this data
Type any question in plain English β Helix builds the chart with AI. Sign in to run it and save your charts.
Data preview
500 rows Β· 9 columns Β· showing first 12| # | text text | id text | dump text | url text | date text | file_path text | language text | language_score float | token_count integer |
|---|---|---|---|---|---|---|---|---|---|
| 1 | How AP reported in all formats from tornado-stricken regionsMarch 8, 2012 When the first serious bout of tornadoes of 2012 blew through mid⦠| <urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff> | CC-MAIN-2013-20 | http://%[email protected]/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions | 2013-05-18T05:48:54Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9721 | 717 |
| 2 | Did you know you have two little yellow, nine-volt-battery-sized adrenal glands in your body, just chilling out, maxinβ, relaxinβ all cool β¦ | <urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6> | CC-MAIN-2013-20 | http://1000awesomethings.com/2012/09/24/934-adrenaline/ | 2013-05-18T08:11:45Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.948 | 821 |
| 3 | Car Wash For Clara! Now is your chance to help! 2 year old Clara Woodward has Cancer! Clara canβt say βNeuroblastomaβ but she knows how it β¦ | <urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec> | CC-MAIN-2013-20 | http://1027kord.com/car-wash-for-clara/ | 2013-05-18T06:49:55Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9115 | 125 |
| 4 | Listeners Get Sky-high View of Missoula From Hot Air Balloons On Friday, June 1, during the Graduation Matters carnival, Townsquare Media ββ¦ | <urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7> | CC-MAIN-2013-20 | http://1075zoofm.com/listeners-get-sky-high-view-of-missoula-from-hot-air-balloons/ | 2013-05-18T06:25:20Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9565 | 103 |
| 5 | Log In Please enter your ECode to log in. Forgotten your eCode? If you created your login but do not remember your eCode please enter your β¦ | <urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc> | CC-MAIN-2013-20 | http://1105govinfoevents.com/enterprisearchitectureevent/public/MyBriefcasef671.html?ID=563&sortMenu=103001&exp=1%2F26%2F2009+3%3A27%3A46+PM | 2013-05-18T05:27:01Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.7982 | 75 |
| 6 | spotlight provides a convenient rechargeable LED light for work play and everyday life. choose from many vibrant colors to match your car, β¦ | <urn:uuid:6bfca20f-ea67-41ba-b995-b7081b4a8b15> | CC-MAIN-2013-20 | http://12vspotlight.com/ | 2013-05-18T06:49:17Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.7543 | 102 |
| 7 | K-State put themselves in sole position of first place in the Big 12 with their 79-70 over Iowa State on Saturday, and K-State is now #10 i⦠| <urn:uuid:dc9d9fd8-5a21-4ab0-bbb2-9718720e1cc2> | CC-MAIN-2013-20 | http://1350kman.com/k-state-now-in-top-10/ | 2013-05-18T07:19:46Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9459 | 204 |
| 8 | Five Reasons I Love Boston 1. The water. The Atlantic Ocean, as deep and true as denim, so blue it melts into the sky, horizonless. And the⦠| <urn:uuid:64f968bf-14bc-48bd-a1bb-a43b3f4a3c3d> | CC-MAIN-2013-20 | http://17andbaking.com/2012/09/30/five-reasons-i-love-boston/?like=1&source=post_flair&_wpnonce=fd9f0e7c6a | 2013-05-18T07:25:34Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9267 | 550 |
| 9 | TRIBE CHIEF TRIS DAZZLES AT DISH, FLUBS IN FIELD IN SIXTH STRAIGHT TIGER WIN By Calvin J. Butterworth June 19, 1924 Ty Cobb can tell you. P⦠| <urn:uuid:2c08e1d4-9706-41d8-84dc-ee2939758c81> | CC-MAIN-2013-20 | http://1924andyouarethere.blogspot.com/2009_07_01_archive.html | 2013-05-18T05:54:12Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9648 | 979 |
| 10 | |Tommy Pi - Trance Experience| |Written by Paul| Tommy Pi started DJing at small private parties at the age of 13. He was always into music⦠| <urn:uuid:7e6216ca-0a01-498d-85f7-7d4aed299c98> | CC-MAIN-2013-20 | http://1mix.co.uk/trance-shows/tommy-pi-trance-experience.html | 2013-05-18T05:54:17Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.9857 | 527 |
| 11 | When I found out we would be getting a PopATot for review I was excited! I knew before we even had it that we would like it, but I had no i⦠| <urn:uuid:0868921d-8323-4a3d-b012-51c15c046cc1> | CC-MAIN-2013-20 | http://1plus1plus1equals1reviews.blogspot.com/2009/10/grand-finale-4-popatot.html?showComment=1256068362402 | 2013-05-18T06:19:38Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.954 | 826 |
| 12 | 2012 Indy Info It seems we canβt find what youβre looking for. Perhaps searching can help. giving it all away... where the illusions of scaβ¦ | <urn:uuid:b7319126-5fdb-4ae0-a17b-584c071b561c> | CC-MAIN-2013-20 | http://2012indyinfo.com/category/sfhs/ | 2013-05-18T08:07:40Z | s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz | en | 0.8378 | 123 |
Auto-generated charts
Fineweb: 500 rows by 9 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.
Charts
language_score over time
Trend of language_score, token_count across date.
language_score vs token_count
Relationship between language_score and token_count.
Distribution of language_score
Histogram of language_score values.
Interesting queries to try
Columns
- text text
- id text
- dump categorical
- url text
- date datetime
- file_path categorical
- language categorical
- language_score numeric
- token_count numeric