Helix the Robot
Helix Helix
arrow_backAll datasets

Codeparrot Clean

Hugging Face

CodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

descriptioncodeparrot--codeparrot-clean.parquet view_list500 rows cloud_downloadcodeparrot/codeparrot-clean
boltOpen in Helix

Interesting queries to try

Columns

  • repo_name text
  • path text
  • copies text
  • size text
  • content text
  • license categorical
  • hash numeric
  • line_mean numeric
  • line_max numeric
  • alpha_frac numeric
  • autogenerated bool

Login to Helix

Don't have an account? Sign up here

Sign Up for Helix

Already have an account? Login here