Fineweb
Hugging Faceπ· FineWeb 15 trillion tokens of the finest data the π web has to offer What is it? The π· FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the π datatrove library, our large scale data processing library. π· FineWeb was originally meant to be a fully open replication of π¦ RefinedWeb, with a releaseβ¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Interesting queries to try
Columns
- text text
- id text
- dump categorical
- url text
- date datetime
- file_path categorical
- language categorical
- language_score numeric
- token_count numeric