Codeparrot Clean
Hugging FaceCodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
Ask a question about this data
Type any question in plain English — Helix builds the chart with AI. Sign in to run it and save your charts.
Data preview
500 rows · 11 columns · showing first 12| # | repo_name text | path text | copies text | size text | content text | license text | hash integer | line_mean float | line_max integer | alpha_frac float | autogenerated boolean |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ahmedbodi/AutobahnPython | examples/asyncio/websocket/echo/client_coroutines.py | 13 | 2044 | ############################################################################### ## ## Copyright (C) 2013-2014 Tavendo GmbH ## ## Licensed… | apache-2.0 | 7822061744094950801 | 31.44 | 79 | 0.6233 | False |
| 2 | ifduyue/django | django/core/checks/registry.py | 13 | 3108 | from itertools import chain from django.utils.itercompat import is_iterable class Tags: """ Built-in tags for internal checks. … | bsd-3-clause | -2035686896372967697 | 30.71 | 91 | 0.6023 | False |
| 3 | kmike/scikit-learn | sklearn/utils/__init__.py | 3 | 10094 | """ The :mod:`sklearn.utils` module includes various utilites. """ from collections import Sequence import numpy as np from scipy.sparse … | bsd-3-clause | 2334709577611160651 | 26.88 | 79 | 0.5681 | False |
| 4 | houlixin/BBB-TISDK | linux-devkit/sysroots/i686-arago-linux/usr/lib/python2.7/encodings/cp1250.py | 593 | 13942 | """ Python Character Mapping Codec cp1250 generated from 'MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT' with gencodec.py. """#" import code… | gpl-2.0 | -6356832018515182181 | 44.41 | 119 | 0.5504 | False |
| 5 | dataxu/ansible | lib/ansible/modules/system/kernel_blacklist.py | 125 | 4009 | #!/usr/bin/python # encoding: utf-8 -*- # Copyright: (c) 2013, Matthias Vogelgesang <[email protected]> # GNU General Public … | gpl-3.0 | 8498771084445726761 | 24.86 | 92 | 0.5752 | False |
| 6 | 163gal/Time-Line | libs_arm/wx/_controls.py | 2 | 332374 | # This file was created automatically by SWIG 1.3.29. # Don't modify this file, modify the SWIG interface instead. import _controls_ impor… | gpl-3.0 | -3839353247363315950 | 41.47 | 151 | 0.655 | False |
| 7 | blackbliss/callme | flask/lib/python2.7/site-packages/werkzeug/contrib/cache.py | 306 | 23519 | # -*- coding: utf-8 -*- """ werkzeug.contrib.cache ~~~~~~~~~~~~~~~~~~~~~~ The main problem with dynamic Web sites is, well, th… | mit | -7111811701270589600 | 33.64 | 86 | 0.5791 | False |
| 8 | pipet/pipet | pipet/sources/zendesk/tasks.py | 2 | 1544 | from contextlib import contextmanager from datetime import datetime from inspect import isclass from celery import chord, group from celer… | apache-2.0 | 156606072465059935 | 28.69 | 96 | 0.6457 | False |
| 9 | tomchristie/django | django/apps/config.py | 55 | 8047 | import os from importlib import import_module from django.core.exceptions import ImproperlyConfigured from django.utils.module_loading imp… | bsd-3-clause | -8530773113433397750 | 38.64 | 81 | 0.5923 | False |
| 10 | prutseltje/ansible | test/units/modules/network/f5/test_bigip_gtm_datacenter.py | 23 | 6819 | # -*- coding: utf-8 -*- # # Copyright (c) 2017 F5 Networks Inc. # GNU General Public License v3.0 (see COPYING or https://www.gnu.org/licen… | gpl-3.0 | 1435203684349960277 | 30.72 | 91 | 0.6287 | False |
| 11 | antb/TPT----My-old-mod | src/python/stdlib/ctypes/test/test_errno.py | 115 | 2330 | import unittest, os, errno from ctypes import * from ctypes.util import find_library from test import test_support try: import threadin… | gpl-2.0 | 785952512028991401 | 28.12 | 69 | 0.5541 | False |
| 12 | Sarah-Alsinan/muypicky | lib/python3.6/site-packages/django/db/backends/sqlite3/creation.py | 60 | 4965 | import os import shutil import sys from django.core.exceptions import ImproperlyConfigured from django.db.backends.base.creation import Ba… | mit | 3775330401011206918 | 42.94 | 97 | 0.5766 | False |
Auto-generated charts
Codeparrot Clean: 500 rows by 11 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.
Charts
Total line_mean by license
Top license values ranked by summed line_mean.
line_mean vs line_max
Relationship between line_mean and line_max.
Distribution of line_mean
Histogram of line_mean values.
Correlation of numeric columns
Pearson correlation between numeric columns.
Interesting queries to try
Columns
- repo_name text
- path text
- copies text
- size text
- content text
- license categorical
- hash numeric
- line_mean numeric
- line_max numeric
- alpha_frac numeric
- autogenerated bool