Common Corpus

Hugging Face

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

PleIAs/common_corpus

This dataset hasn't been imported yet, so it can't be charted here. You can browse it on Hugging Face.

Interesting queries to try

play_arrow top 10 rows of Common Corpus with summary statistics
play_arrow counts grouped by the most common field in Common Corpus
play_arrow summary charts for Common Corpus

Interesting queries to try

Related datasets