Helix the Robot
Helix
arrow_backAll datasets

Codeparrot Clean

Hugging Face

CodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

descriptioncodeparrot--codeparrot-clean.parquet view_list500 rows cloud_downloadcodeparrot/codeparrot-clean
boltOpen in Helix

Ask a question about this data

Type any question in plain English — Helix builds the chart with AI. Sign in to run it and save your charts.

auto_awesome

Data preview

500 rows · 11 columns · showing first 12
# repo_name text path text copies text size text content text license text hash integer line_mean float line_max integer alpha_frac float autogenerated boolean
1 ahmedbodi/AutobahnPythonexamples/asyncio/websocket/echo/client_coroutines.py132044############################################################################### ## ## Copyright (C) 2013-2014 Tavendo GmbH ## ## Licensed…apache-2.0782206174409495080131.44790.6233False
2 ifduyue/djangodjango/core/checks/registry.py133108from itertools import chain from django.utils.itercompat import is_iterable class Tags: """ Built-in tags for internal checks. …bsd-3-clause-203568689637296769730.71910.6023False
3 kmike/scikit-learnsklearn/utils/__init__.py310094""" The :mod:`sklearn.utils` module includes various utilites. """ from collections import Sequence import numpy as np from scipy.sparse …bsd-3-clause233470957761116065126.88790.5681False
4 houlixin/BBB-TISDKlinux-devkit/sysroots/i686-arago-linux/usr/lib/python2.7/encodings/cp1250.py59313942""" Python Character Mapping Codec cp1250 generated from 'MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT' with gencodec.py. """#" import code…gpl-2.0-635683201851518218144.411190.5504False
5 dataxu/ansiblelib/ansible/modules/system/kernel_blacklist.py1254009#!/usr/bin/python # encoding: utf-8 -*- # Copyright: (c) 2013, Matthias Vogelgesang <[email protected]> # GNU General Public …gpl-3.0849877108444572676124.86920.5752False
6 163gal/Time-Linelibs_arm/wx/_controls.py2332374# This file was created automatically by SWIG 1.3.29. # Don't modify this file, modify the SWIG interface instead. import _controls_ impor…gpl-3.0-383935324736331595041.471510.655False
7 blackbliss/callmeflask/lib/python2.7/site-packages/werkzeug/contrib/cache.py30623519# -*- coding: utf-8 -*- """ werkzeug.contrib.cache ~~~~~~~~~~~~~~~~~~~~~~ The main problem with dynamic Web sites is, well, th…mit-711181170127058960033.64860.5791False
8 pipet/pipetpipet/sources/zendesk/tasks.py21544from contextlib import contextmanager from datetime import datetime from inspect import isclass from celery import chord, group from celer…apache-2.015660607246505993528.69960.6457False
9 tomchristie/djangodjango/apps/config.py558047import os from importlib import import_module from django.core.exceptions import ImproperlyConfigured from django.utils.module_loading imp…bsd-3-clause-853077311343339775038.64810.5923False
10 prutseltje/ansibletest/units/modules/network/f5/test_bigip_gtm_datacenter.py236819# -*- coding: utf-8 -*- # # Copyright (c) 2017 F5 Networks Inc. # GNU General Public License v3.0 (see COPYING or https://www.gnu.org/licen…gpl-3.0143520368434996027730.72910.6287False
11 antb/TPT----My-old-modsrc/python/stdlib/ctypes/test/test_errno.py1152330import unittest, os, errno from ctypes import * from ctypes.util import find_library from test import test_support try: import threadin…gpl-2.078595251202899140128.12690.5541False
12 Sarah-Alsinan/muypickylib/python3.6/site-packages/django/db/backends/sqlite3/creation.py604965import os import shutil import sys from django.core.exceptions import ImproperlyConfigured from django.db.backends.base.creation import Ba…mit377533040101120691842.94970.5766False

Auto-generated charts

Codeparrot Clean: 500 rows by 11 columns. These exploratory charts are generated automatically from the data - open the dataset in Helix to ask your own questions.

Rows500
Columns11
Numeric cols3
Categorical cols2

Charts

Total line_mean by license

Top license values ranked by summed line_mean.

line_mean vs line_max

Relationship between line_mean and line_max.

Distribution of line_mean

Histogram of line_mean values.

Correlation of numeric columns

Pearson correlation between numeric columns.

Interesting queries to try

Columns

  • repo_name text
  • path text
  • copies text
  • size text
  • content text
  • license categorical
  • hash numeric
  • line_mean numeric
  • line_max numeric
  • alpha_frac numeric
  • autogenerated bool

Login to Helix

Don't have an account? Sign up here

Sign Up for Helix

Already have an account? Login here