Helix the Robot
Helix
arrow_backAll datasets

Fineweb

Hugging Face

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of πŸ¦… RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

descriptionhuggingfacefw--fineweb.parquet view_list500 rows cloud_downloadHuggingFaceFW/fineweb
boltOpen in Helix

Ask a question about this data

Type any question in plain English β€” Helix builds the chart with AI. Sign in to run it and save your charts.

auto_awesome

Data preview

500 rows Β· 9 columns Β· showing first 12
# text text id text dump text url text date text file_path text language text language_score float token_count integer
1 How AP reported in all formats from tornado-stricken regionsMarch 8, 2012 When the first serious bout of tornadoes of 2012 blew through mid…<urn:uuid:d66bc6fe-8477-4adf-b430-f6a558ccc8ff>CC-MAIN-2013-20http://%[email protected]/Content/Press-Release/2012/How-AP-reported-in-all-formats-from-tornado-stricken-regions2013-05-18T05:48:54Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9721717
2 Did you know you have two little yellow, nine-volt-battery-sized adrenal glands in your body, just chilling out, maxin’, relaxin’ all cool …<urn:uuid:803e14c3-dc2e-43d6-b75d-6fb3981c4fe6>CC-MAIN-2013-20http://1000awesomethings.com/2012/09/24/934-adrenaline/2013-05-18T08:11:45Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.948821
3 Car Wash For Clara! Now is your chance to help! 2 year old Clara Woodward has Cancer! Clara can’t say β€œNeuroblastoma” but she knows how it …<urn:uuid:ac1bbfff-9519-4967-9c64-3dc3a4b471ec>CC-MAIN-2013-20http://1027kord.com/car-wash-for-clara/2013-05-18T06:49:55Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9115125
4 Listeners Get Sky-high View of Missoula From Hot Air Balloons On Friday, June 1, during the Graduation Matters carnival, Townsquare Media –…<urn:uuid:c1445c58-b111-4c4e-badd-1e43ec317df7>CC-MAIN-2013-20http://1075zoofm.com/listeners-get-sky-high-view-of-missoula-from-hot-air-balloons/2013-05-18T06:25:20Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9565103
5 Log In Please enter your ECode to log in. Forgotten your eCode? If you created your login but do not remember your eCode please enter your …<urn:uuid:e5829f7d-b944-4468-9573-61b7cb3078cc>CC-MAIN-2013-20http://1105govinfoevents.com/enterprisearchitectureevent/public/MyBriefcasef671.html?ID=563&sortMenu=103001&exp=1%2F26%2F2009+3%3A27%3A46+PM2013-05-18T05:27:01Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.798275
6 spotlight provides a convenient rechargeable LED light for work play and everyday life. choose from many vibrant colors to match your car, …<urn:uuid:6bfca20f-ea67-41ba-b995-b7081b4a8b15>CC-MAIN-2013-20http://12vspotlight.com/2013-05-18T06:49:17Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.7543102
7 K-State put themselves in sole position of first place in the Big 12 with their 79-70 over Iowa State on Saturday, and K-State is now #10 i…<urn:uuid:dc9d9fd8-5a21-4ab0-bbb2-9718720e1cc2>CC-MAIN-2013-20http://1350kman.com/k-state-now-in-top-10/2013-05-18T07:19:46Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9459204
8 Five Reasons I Love Boston 1. The water. The Atlantic Ocean, as deep and true as denim, so blue it melts into the sky, horizonless. And the…<urn:uuid:64f968bf-14bc-48bd-a1bb-a43b3f4a3c3d>CC-MAIN-2013-20http://17andbaking.com/2012/09/30/five-reasons-i-love-boston/?like=1&source=post_flair&_wpnonce=fd9f0e7c6a2013-05-18T07:25:34Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9267550
9 TRIBE CHIEF TRIS DAZZLES AT DISH, FLUBS IN FIELD IN SIXTH STRAIGHT TIGER WIN By Calvin J. Butterworth June 19, 1924 Ty Cobb can tell you. P…<urn:uuid:2c08e1d4-9706-41d8-84dc-ee2939758c81>CC-MAIN-2013-20http://1924andyouarethere.blogspot.com/2009_07_01_archive.html2013-05-18T05:54:12Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9648979
10 |Tommy Pi - Trance Experience| |Written by Paul| Tommy Pi started DJing at small private parties at the age of 13. He was always into music…<urn:uuid:7e6216ca-0a01-498d-85f7-7d4aed299c98>CC-MAIN-2013-20http://1mix.co.uk/trance-shows/tommy-pi-trance-experience.html2013-05-18T05:54:17Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.9857527
11 When I found out we would be getting a PopATot for review I was excited! I knew before we even had it that we would like it, but I had no i…<urn:uuid:0868921d-8323-4a3d-b012-51c15c046cc1>CC-MAIN-2013-20http://1plus1plus1equals1reviews.blogspot.com/2009/10/grand-finale-4-popatot.html?showComment=12560683624022013-05-18T06:19:38Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.954826
12 2012 Indy Info It seems we can’t find what you’re looking for. Perhaps searching can help. giving it all away... where the illusions of sca…<urn:uuid:b7319126-5fdb-4ae0-a17b-584c071b561c>CC-MAIN-2013-20http://2012indyinfo.com/category/sfhs/2013-05-18T08:07:40Zs3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gzen0.8378123

Auto-generated charts

Fineweb: 500 rows by 9 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.

Rows500
Columns9
Numeric cols2
Categorical cols3
Date range2013-05-18 β†’ 2013-05-18

Charts

language_score over time

Trend of language_score, token_count across date.

language_score vs token_count

Relationship between language_score and token_count.

Distribution of language_score

Histogram of language_score values.

Interesting queries to try

Columns

  • text text
  • id text
  • dump categorical
  • url text
  • date datetime
  • file_path categorical
  • language categorical
  • language_score numeric
  • token_count numeric

Login to Helix

Don't have an account? Sign up here

Sign Up for Helix

Already have an account? Login here