Helix the Robot
Helix Helix
arrow_backAll datasets

Fineweb

Hugging Face

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of πŸ¦… RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

descriptionhuggingfacefw--fineweb.parquet view_list500 rows cloud_downloadHuggingFaceFW/fineweb
boltOpen in Helix

Interesting queries to try

Columns

  • text text
  • id text
  • dump categorical
  • url text
  • date datetime
  • file_path categorical
  • language categorical
  • language_score numeric
  • token_count numeric

Login to Helix

Don't have an account? Sign up here

Sign Up for Helix

Already have an account? Login here