The Stack Smol

Hugging Face

Dataset Description A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code). Languages The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.

bigcode/the-stack-smol

This dataset hasn't been imported yet, so it can't be charted here. You can browse it on Hugging Face.

Interesting queries to try

play_arrow top 10 rows of The Stack Smol with summary statistics
play_arrow counts grouped by the most common field in The Stack Smol
play_arrow summary charts for The Stack Smol

Interesting queries to try

Related datasets