Codeparrot Clean
Hugging FaceCodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
Interesting queries to try
Columns
- repo_name text
- path text
- copies text
- size text
- content text
- license categorical
- hash numeric
- line_mean numeric
- line_max numeric
- alpha_frac numeric
- autogenerated bool