Datasets
Browse curated open datasets. Click one to see example queries and open it in Helix.
Carshare
Carshare dataset, build charts and dashboards with AI
- “show carshare density across the map”
- “which hours have the most car hours”
Election
Election dataset, build charts and dashboards with AI
- “who won the most votes by district”
- “show vote share by candidate as a stacked bar”
Tips
Tips dataset, build charts and dashboards with AI
- “average tip percentage by day of the week”
- “compare tip amounts for smokers vs non-smokers”
Gapminder
Create instant charts and dashboards with AI and the gapminder inequality dataset
- “life expectancy over time by continent”
- “GDP per capita vs life expectancy as a bubble chart”
Iris
Iris dataset, build charts and dashboards with AI GPT4 asistant and the iconic flower dataset
- “scatter petal length vs petal width coloured by species”
- “boxplot sepal width by species”
Medals_long
Medals dataset, build charts and dashboards with AI
- “total medals per country”
- “gold medal ratio by country”
Stocks
Stocks dataset, build charts and dashboards with AI
- “price over time for each ticker”
- “rolling 30 day returns per ticker”
Experiments
Experiment dataset, build charts and dashboards with AI
- “mean outcome by experimental group”
- “distribution of outcomes by condition”
wind
Wind speed dataset, build charts polar coordinates/directional graphs with AI
- “polar plot of wind direction weighted by speed”
- “wind speed distribution as a histogram”
Iris
Iris Species Dataset The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The dataset is taken from UCI Machine Learning Repository's… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/iris.
- “average Id by Species”
- “scatter Id vs SepalLengthCm”
Adult Census Income
Adult Census Income Dataset The following was retrieved from UCI machine learning repository. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year. Description of fnlwgt (final weight)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/adult-census-income.
- “average age by workclass”
- “scatter age vs fnlwgt”
Breast Cancer Wisconsin
Breast Cancer Wisconsin Diagnostic Dataset Following description was retrieved from breast cancer dataset on UCI machine learning repository. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at here. Separating plane described above was obtained using Multisurface Method-Tree (MSM-T), a classification method which uses linear… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/breast-cancer-wisconsin.
- “average id by diagnosis”
- “scatter id vs radius_mean”
Heart
Heart The Heart dataset from the UCI ML repository. Does the patient have heart disease? Configurations and tasks Configuration Task hungary Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/heart", "hungary")["train"]
- “distribution of outcomes by condition”
- “average duration by diagnosis”
Adult
Adult The Adult dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold. Configurations and tasks Configuration Task Description income Binary classification Classify the person's income as over or under the threshold. income-no race Binary classification As income, but the race feature is removed. race Multiclass classification Predict the race of the individual. Usage… See the full description on the dataset page: https://huggingface.co/datasets/mstz/adult.
- “average age by marital_status”
- “scatter age vs capital_gain”
Wine
Wine The Wine dataset from Kaggle. Classify wine as red or white. Configurations and tasks Configuration Task Description wine Binary classification Is this red wine? Usage from datasets import load_dataset dataset = load_dataset("mstz/wine")["train"]
- “scatter fixed_acidity vs volatile_acidity”
- “correlation heatmap of all numeric columns”
Titanic
Titanic The Titanic dataset from Kaggle. Configurations and tasks Configuration Task Description survival Binary classification Has the passanger survived? Usage from datasets import load_dataset dataset = load_dataset("mstz/titanic")["train"]
- “summary charts for the Titanic dataset”
- “top 10 rows of Titanic with key statistics”
Abalone
Abalone The Abalone dataset from the UCI ML repository. Predict the age of the given abalone. Configurations and tasks Configuration Task Description abalone Regression Predict the age of the abalone. binary Binary classification Does the abalone have more than 9 rings? Usage from datasets import load_dataset dataset = load_dataset("mstz/abalone")["train"] Features Target feature in bold. Feature Type sex [string]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/abalone.
- “average length by sex”
- “scatter length vs diameter”
Car
Car The Car dataset from the UCI repository. Classify the acceptability level of a car for resale. Configurations and tasks Configuration Task Description car Multiclass classification What is the acceptability level of the car? car_binary Binary classification Is the car acceptable? Usage from datasets import load_dataset dataset = load_dataset("mstz/car", "car_binary")["train"]
- “scatter buying vs maint”
- “correlation heatmap of all numeric columns”
Mushroom
Mushroom The Mushroom dataset from the UCI ML repository. Configurations and tasks Configuration Task Description mushroom Binary classification Is the mushroom poisonous? Usage from datasets import load_dataset dataset = load_dataset("mstz/mushroom")["train"]
- “summary charts for the Mushroom dataset”
- “top 10 rows of Mushroom with key statistics”
Glass
Glass The Glass dataset from the UCI repository. Classify the type of glass. Configurations and tasks Configuration Task Description glass Multiclass classification Classify glass type. windows Binary classification Is this windows glass? vehicles Binary classification Is this vehicles glass? containers Binary classification Is this containers glass? tableware Binary classification Is this tableware glass? headlamps Binary classification Is this… See the full description on the dataset page: https://huggingface.co/datasets/mstz/glass.
- “scatter refractive_index vs sodium”
- “correlation heatmap of all numeric columns”
Demo1
Dataset Card for Demo1 Dataset Summary This is a demo dataset. It consists in two files data/train.csv and data/test.csv You can load it with from datasets import load_dataset demo1 = load_dataset("lhoestq/demo1") Supported Tasks and Leaderboards [More Information Needed] Languages [More Information Needed] Dataset Structure Data Instances [More Information Needed] Data Fields [More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/lhoestq/demo1.
- “show star over date as a line chart”
- “average star by package_name”
Custom Squad
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- “answer length distribution”
- “most common question types”
Reddit Finance 43 250K
reddit finance 43 250k reddit_finance_43_250k is a collection of 250k post/comment pairs from 43 financial, investing and crypto subreddits. Post must have all been text, with a length of 250chars, and a positive score. Each subreddit is narrowed down to the 70th qunatile before being mergered with their top 3 comments and than the other subs. Further score based methods are used to select the top 250k post/comment pairs. The code to recreate the dataset is here:… See the full description on the dataset page: https://huggingface.co/datasets/winddude/reddit_finance_43_250k.
- “scatter z_score vs normalized_score”
- “correlation heatmap of all numeric columns”
Finance Tasks
Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024) This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/finance-tasks.
- “answer length distribution”
- “most common question types”
Financial Classification
Dataset Creation This dataset combines financial phrasebank dataset and a financial text dataset from Kaggle. Given the financial phrasebank dataset does not have a validation split, I thought this might help to validate finance models and also capture the impact of COVID on financial earnings with the more recent Kaggle dataset.
- “most common values in text”
- “default rates by segment”
Ml Arxiv Papers
This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.
- “scatter Unnamed: 0.1 vs Unnamed: 0”
- “most common values in title”
Ai Arxiv Chunked
Hugging Face dataset: jamescalam/ai-arxiv-chunked
- “distribution of doi”
- “most common values in chunk-id”
Github Code Clean
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
- “commits per language”
- “distribution of file sizes”
Github Jupyter Code To Text
Dataset description This dataset consists of sequences of Python code followed by a a docstring explaining its function. It was constructed by concatenating code and text pairs from this dataset that were originally code and markdown cells in Jupyter Notebooks. The content of each example the following: [CODE] """ Explanation: [TEXT] End of explanation """ [CODE] """ Explanation: [TEXT] End of explanation """ ... How to use it from datasets import load_dataset ds =… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text.
- “most common values in repo_name”
- “commits per language”
The Stack Smol
Dataset Description A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code). Languages The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.
- “commits per language”
- “distribution of file sizes”
Apps
APPS is a benchmark for Python code generation, it includes 10,000 problems, which range from having simple oneline solutions to being substantial algorithmic challenges, for more details please refer to this paper: https://arxiv.org/pdf/2105.09938.pdf.
- “commits per language”
- “distribution of file sizes”
World Cities Geo
Dataset containing city, country, region, and continents alongside their longitude and latitude co-ordinates. Cartesian coordinates are provided in x, y, z features.
- “average latitude by country”
- “scatter latitude vs longitude”
Channel Metadata
Dataset containing video metadata from a few tech channels, i.e. James Briggs Yannic Kilcher sentdex Daniel Bourke AI Coffee Break with Letitia Alex Ziskind
- “show Like Count over Time Created as a line chart”
- “average Like Count by Channel ID”
Youtube Transcriptions
The YouTube transcriptions dataset contains technical tutorials (currently from James Briggs, Daniel Bourke, and AI Coffee Break) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp. Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will… See the full description on the dataset page: https://huggingface.co/datasets/jamescalam/youtube-transcriptions.
- “average start by title”
- “scatter start vs end”
Codealpaca 20K
This dataset splits the original CodeAlpaca dataset into train and test splits.
- “most common values in prompt”
- “commits per language”
Medical Meadow Medqa
Dataset Card for MedQA Dataset Summary This is the data and baseline source code for the paper: Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams." From https://github.com/jind11/MedQA: The data that contains both the QAs and textbooks can be downloaded from this google drive folder. A bit of details of data are explained as below: For QAs, we have three sources: US, Mainland of China, and… See the full description on the dataset page: https://huggingface.co/datasets/medalpaca/medical_meadow_medqa.
- “distribution of instruction”
- “most common values in input”
Wiki Medical Terms
Dataset Card for [Dataset Name] Dataset Summary This data set contains over 6,000 medical terms and their wikipedia text. It is intended to be used on a downstream task that requires medical terms and their wikipedia explanation. Dataset Structure Data Instances [More Information Needed] Data Fields [More Information Needed] Data Splits [More Information Needed] Dataset Creation Curation Rationale [More… See the full description on the dataset page: https://huggingface.co/datasets/gamino/wiki_medical_terms.
- “most common values in page_title”
- “distribution of outcomes by condition”
Medical Qa Datasets
all-processed dataset is a concatenation of of medical-meadow-* and chatdoctor_healthcaremagic datasets The Chat Doctor term is replaced by the chatbot term in the chatdoctor_healthcaremagic dataset Similar to the literature the medical_meadow_cord19 dataset is subsampled to 50,000 samples truthful-qa-* is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to Health and… See the full description on the dataset page: https://huggingface.co/datasets/lavita/medical-qa-datasets.
- “distribution of outcomes by condition”
- “average duration by diagnosis”
Healthcare Data
Hugging Face dataset: Nicolybgs/healthcare_data
- “show Available Extra Rooms in Hospital over Stay (in days) as a line chart”
- “average Available Extra Rooms in Hospital by Department”
Chest Xray Classification
Dataset Labels ['NORMAL', 'PNEUMONIA'] Number of Images {'train': 4077, 'test': 582, 'valid': 1165} How to Use Install datasets: pip install datasets Load the dataset: from datasets import load_dataset ds = load_dataset("keremberke/chest-xray-classification", name="full") example = ds['train'][0] Roboflow Dataset Page https://universe.roboflow.com/mohamed-traore-2ekkp/chest-x-rays-qjmia/dataset/2 Citation… See the full description on the dataset page: https://huggingface.co/datasets/keremberke/chest-xray-classification.
- “class distribution across images”
- “sample sizes per category”
Climate Sentiment
Dataset Card for climate_sentiment Dataset Summary We introduce an expert-annotated dataset for classifying climate-related sentiment of climate-related paragraphs in corporate disclosures. Supported Tasks and Leaderboards The dataset supports a ternary sentiment classification task of whether a given climate-related paragraph has sentiment opportunity, neutral, or risk. Languages The text in the dataset is in English. Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/climate_sentiment.
- “most common values in text”
- “sentiment distribution across reviews”
Climate Detection
Dataset Card for climate_detection Dataset Summary We introduce an expert-annotated dataset for detecting climate-related paragraphs in corporate disclosures. Supported Tasks and Leaderboards The dataset supports a binary classification task of whether a given paragraph is climate-related or not. Languages The text in the dataset is in English. Dataset Structure Data Instances { 'text': '− Scope 3: Optional scope that includes… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/climate_detection.
- “most common values in text”
- “temperature trend over time”
Environmental Claims
Dataset Card for environmental_claims Dataset Summary We introduce an expert-annotated dataset for detecting real-world environmental claims made by listed companies. Supported Tasks and Leaderboards The dataset supports a binary classification task of whether a given sentence is an environmental claim or not. Languages The text in the dataset is in English. Dataset Structure Data Instances { "text": "It will enable E.ON to… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/environmental_claims.
- “most common values in text”
- “temperature trend over time”
Imdb
Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
- “most common values in text”
- “sentiment distribution across reviews”
Ag News
Dataset Card for "ag_news" Dataset Summary AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/ag_news.
- “most common values in text”
- “summary charts for the Ag News dataset”
Amazon Polarity
Dataset Card for Amazon Review Polarity Dataset Summary The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. Supported Tasks and Leaderboards text-classification, sentiment-classification: The dataset is mainly used for text classification: given the content and the title, predict the correct… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/amazon_polarity.
- “most common values in title”
- “sentiment distribution across reviews”
Banking77
Dataset Card for BANKING77 Dataset Summary Deprecated: Dataset "banking77" is deprecated and will be deleted. Use "PolyAI/banking77" instead. Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/legacy-datasets/banking77.
- “most common values in text”
- “default rates by segment”
Glue
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
- “scatter label vs idx”
- “most common values in sentence”
Glue
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
- “scatter label vs idx”
- “most common values in sentence”
Glue
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
- “scatter label vs idx”
- “most common values in sentence1”
Glue
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
- “scatter label vs idx”
- “most common values in question1”
Super Glue
Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.
- “scatter idx vs label”
- “most common values in question”
Squad
Dataset Card for SQuAD Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. Supported Tasks and Leaderboards Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
- “distribution of title”
- “most common values in id”
Squad V2
Dataset Card for SQuAD 2.0 Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.
- “distribution of title”
- “most common values in id”
Wmt16
Dataset Card for "wmt16" Dataset Summary Warning: There are issues with the Common Crawl corpus data (training-parallel-commoncrawl.tgz): Non-English files contain many English sentences. Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart. We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains… See the full description on the dataset page: https://huggingface.co/datasets/wmt/wmt16.
- “summary charts for the Wmt16 dataset”
- “top 10 rows of Wmt16 with key statistics”
Conll2003
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the sec…
- “clip length distribution”
- “speaker counts”
Rotten Tomatoes
Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.
- “most common values in text”
- “sentiment distribution across reviews”
Yelp Polarity
Dataset Card for "yelp_polarity" Dataset Summary Large Yelp Review Dataset. This is a dataset for binary sentiment classification. We provide a set of 560,000 highly polar yelp reviews for training, and 38,000 for testing. ORIGIN The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge The Yelp reviews polarity dataset is constructed by… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/yelp_polarity.
- “most common values in text”
- “sentiment distribution across reviews”
Trec
The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700. Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.
- “answer length distribution”
- “most common question types”
Dbpedia 14
Dataset Card for DBpedia14 Dataset Summary The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000. There are 3 columns in the dataset (same for train and test splits), corresponding to… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/dbpedia_14.
- “most common values in title”
- “summary charts for the Dbpedia 14 dataset”
Emotion
Dataset Card for "emotion" Dataset Summary Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances An example looks as follows. { "text": "im feeling quite sad and sorry for myself but… See the full description on the dataset page: https://huggingface.co/datasets/dair-ai/emotion.
- “most common values in text”
- “post volume over time”
Tweet Eval
Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.
- “most common values in text”
- “sentiment distribution across reviews”
Tweet Eval
Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.
- “most common values in text”
- “sentiment distribution across reviews”
Tweet Eval
Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.
- “most common values in text”
- “sentiment distribution across reviews”
Go Emotions
Dataset Card for GoEmotions Dataset Summary The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test splits. Supported Tasks and Leaderboards This dataset is intended for multi-class, multi-label emotion classification. Languages The data is in English. Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/go_emotions.
- “most common values in text”
- “summary charts for the Go Emotions dataset”
Snli
Dataset Card for SNLI Dataset Summary The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Supported Tasks and Leaderboards Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/snli.
- “most common values in premise”
- “summary charts for the Snli dataset”
Multi Nli
Dataset Card for Multi-Genre Natural Language Inference (MultiNLI) Dataset Summary The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/multi_nli.
- “average promptID by genre”
- “scatter promptID vs label”
Hellaswag
Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.
- “average ind by activity_label”
- “distribution of activity_label”
Piqa
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials. The underlying task…
- “answer length distribution”
- “most common question types”
Winogrande
Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.
- “distribution of answer”
- “most common values in sentence”
Openai Humaneval
Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
- “commits per language”
- “distribution of file sizes”
Mbpp
Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.
- “average task_id by test_setup_code”
- “distribution of test_setup_code”
Lambada
Dataset Card for LAMBADA Dataset Summary The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local… See the full description on the dataset page: https://huggingface.co/datasets/cimec/lambada.
- “distribution of domain”
- “most common values in text”
Mnist
Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.
- “average label by image”
- “distribution of image”
Cifar10
Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.
- “average label by img”
- “distribution of img”
Cifar100
Dataset Card for CIFAR-100 Dataset Summary The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 500 training images and 100 testing images per class. There are 50000 training images and 10000 test images. The 100 classes are grouped into 20 superclasses. There are two labels per image - fine label (actual class) and coarse label (superclass). Supported Tasks and Leaderboards image-classification: The… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar100.
- “average fine_label by img”
- “scatter fine_label vs coarse_label”
Fashion Mnist
Dataset Card for FashionMNIST Dataset Summary Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing… See the full description on the dataset page: https://huggingface.co/datasets/zalando-datasets/fashion_mnist.
- “average label by image”
- “distribution of image”
Food101
Dataset Card for Food-101 Dataset Summary This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels. Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/ethz/food101.
- “most common values in image”
- “sentiment distribution across reviews”
Speech Commands
This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. This dataset is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209). Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains 64,727 audio files. In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".…
- “commits per language”
- “distribution of file sizes”
Common Voice
Common Voice is Mozilla's initiative to help teach machines how real people speak. The dataset currently consists of 7,335 validated hours of speech in 60 languages, but we’re always adding more voices and languages.
- “clip length distribution”
- “speaker counts”
Arxiv Dataset
A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.
- “summary charts for the Arxiv Dataset dataset”
- “top 10 rows of Arxiv Dataset with key statistics”
Arxiv Classification
Arxiv Classification: a classification of Arxiv Papers (11 classes). This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning" @ARTICLE{8675939, author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao}, journal={IEEE Access}, title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.
- “most common values in text”
- “summary charts for the Arxiv Classification dataset”
Arxiver
Arxiver Dataset Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization. Curation The Arxiver dataset is… See the full description on the dataset page: https://huggingface.co/datasets/neuralwork/arxiver.
- “most common values in id”
- “answer length distribution”
Amazon Reviews Multi
We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated aft…
- “sentiment distribution across reviews”
- “average rating by category”
Yelp Review Full
Dataset Card for YelpReviewFull Dataset Summary The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. Supported Tasks and Leaderboards text-classification, sentiment-classification: The dataset is mainly used for text classification: given the text, predict the sentiment. Languages The reviews were mainly written in english. Dataset Structure Data Instances A… See the full description on the dataset page: https://huggingface.co/datasets/Yelp/yelp_review_full.
- “most common values in text”
- “sentiment distribution across reviews”
Poem Sentiment
Dataset Card for Gutenberg Poem Dataset Dataset Summary Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems. Supported Tasks and Leaderboards [More Information Needed] Languages The text in the dataset is in English (en). Dataset Structure Data Instances Example of one instance in the dataset. {'id': 0… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/poem_sentiment.
- “scatter id vs label”
- “most common values in verse_text”
Cnn Dailymail
Dataset Card for CNN Dailymail Dataset Dataset Summary The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. Supported Tasks and Leaderboards 'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.
- “most common values in article”
- “answer length distribution”
Xsum
Dataset Card for "xsum" Dataset Summary Extreme Summarization (XSum) Dataset. There are three features: document: Input news article. summary: One sentence summary of the article. id: BBC ID of the article. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 257.30 MB Size of the generated dataset:… See the full description on the dataset page: https://huggingface.co/datasets/EdinburghNLP/xsum.
- “most common values in document”
- “summary charts for the Xsum dataset”
Newsroom
NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. Dataset features includes: - text: Input news text. - summary: Summary for the news. And additional features: - title: news title. - url: url of the news. - date: date of the article. - density: extractive density. - coverage: extractive coverage. - compression: compression ratio. - density_bin: low, medium, high. - coverage_bin: extractive, abstractive. - compression_bin: low, medium, high. This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
- “summary charts for the Newsroom dataset”
- “top 10 rows of Newsroom with key statistics”
Multi News
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. There are two features: - document: text of news articles seperated by special token "|||||". - summary: news summary.
- “summary charts for the Multi News dataset”
- “top 10 rows of Multi News with key statistics”
Lex Glue
Dataset Card for "LexGLUE" Dataset Summary Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a benchmark dataset to evaluate… See the full description on the dataset page: https://huggingface.co/datasets/coastalcph/lex_glue.
- “summary charts for the Lex Glue dataset”
- “top 10 rows of Lex Glue with key statistics”
Billsum
Dataset Card for "billsum" Dataset Summary BillSum, summarization of US Congressional and California state bills. There are several features: text: bill text. summary: summary of the bills. title: title of the bills. features for us bills. ca bills does not have. text_len: number of chars in text. sum_len: number of chars in summary. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/FiscalNote/billsum.
- “most common values in text”
- “summary charts for the Billsum dataset”
Openai Humaneval
Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.
- “commits per language”
- “distribution of file sizes”
Codeparrot Clean
CodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.
- “average hash by license”
- “scatter hash vs line_mean”
Conala
CoNaLa is a dataset of code and natural language pairs crawled from Stack Overflow, for more details please refer to this paper: https://arxiv.org/pdf/1805.08949.pdf or the dataset page https://conala-corpus.github.io/.
- “commits per language”
- “distribution of file sizes”
Synthetic Text To Sql
Image generated by DALL-E. See prompt for more details synthetic_text_to_sql gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes: 105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
- “average id by sql_complexity”
- “distribution of sql_complexity”
Sql Create Context
Overview This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.
- “most common values in answer”
- “answer length distribution”
Wikitext
Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.
- “most common values in text”
- “distribution of wikitext over time”
Oasst1 Pairwise Rlhf Reward
Dataset Card for "oasst1_pairwise_rlhf_reward" OASST1 dataset preprocessed for reward modeling: import pandas as pd from datasets import load_dataset,concatenate_datasets, Dataset, DatasetDict import numpy as np dataset = load_dataset("OpenAssistant/oasst1") df=concatenate_datasets(list(dataset.values())).to_pandas() m2t=df.set_index("message_id")['text'].to_dict() m2r=df.set_index("message_id")['role'].to_dict() m2p=df.set_index('message_id')['parent_id'].to_dict()… See the full description on the dataset page: https://huggingface.co/datasets/tasksource/oasst1_pairwise_rlhf_reward.
- “distribution of lang”
- “most common values in parent_id”
Credit Card Clients
Default of Credit Card Clients Dataset The following was retrieved from UCI machine learning repository. Dataset Information This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. Content There are 25 variables: ID: ID of each client LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit SEX:… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/credit-card-clients.
- “show ID over default.payment.next.month as a line chart”
- “scatter ID vs LIMIT_BAL”
Auto Mpg
Auto Miles per Gallon (MPG) Dataset Following description was taken from UCI machine learning repository. Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. Data Set Information: This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/auto-mpg.
- “show mpg over model year as a line chart”
- “scatter mpg vs cylinders”
Awesome Chatgpt Prompts
a.k.a. Awesome ChatGPT Prompts This is a Dataset Repository mirror of prompts.chat — a social platform for AI prompts. 📢 Notice This Hugging Face dataset is a mirror. For the latest prompts, features, and community contributions, please visit: 🌐 Website: prompts.chat 📦 GitHub: github.com/f/awesome-chatgpt-prompts About prompts.chat is an open-source platform where users can share, discover, and collect AI prompts from the community. The project can be… See the full description on the dataset page: https://huggingface.co/datasets/fka/prompts.chat.
- “distribution of type”
- “most common values in act”
Dialogstudio
DialogStudio: Unified Dialog Datasets and Instruction-Aware Models for Conversational AI Author: Jianguo Zhang, Kun Qian Paper|Github|[GDrive] 🎉 March 18, 2024: Update for AI Agent. Check xLAM for the latest data and models relevant to AI Agent! 🎉 March 10 2024: Update for dataset viewer issues: Please refer to https://github.com/salesforce/DialogStudio for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. For… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/dialogstudio.
- “distribution of dialogstudio over time”
- “top 10 highest dialogstudio”
Mathinstruct
🦣 MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning MathInstruct is a meticulously curated instruction tuning dataset that is lightweight yet generalizable. MathInstruct is compiled from 13 math rationale datasets, six of which are newly curated by this work. It uniquely focuses on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and ensures extensive coverage of diverse mathematical fields. Project Page:… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MathInstruct.
- “most common values in source”
- “summary charts for the Mathinstruct dataset”
Math Qa
Our dataset is gathered by using a new representation language to annotate over the AQuA-RAT dataset. AQuA-RAT has provided the questions, options, rationale, and the correct options.
- “answer length distribution”
- “most common question types”
Competition Math
The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of problems from mathematics competitions, including the AMC 10, AMC 12, AIME, and more. Each problem in MATH has a full step-by-step solution, which can be used to teach models to generate answer derivations and explanations.
- “answer length distribution”
- “most common question types”
Gsm8K
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
- “most common values in question”
- “answer length distribution”
Natural Questions
Dataset Card for Natural Questions Dataset Summary The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets. Supported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/natural_questions.
- “most common values in id”
- “answer length distribution”
Hotpot Qa
Dataset Card for "hotpot_qa" Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason… See the full description on the dataset page: https://huggingface.co/datasets/hotpotqa/hotpot_qa.
- “distribution of type”
- “most common values in id”
Trivia Qa
Dataset Card for "trivia_qa" Dataset Summary TriviaqQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. Supported Tasks and Leaderboards More Information Needed Languages English.… See the full description on the dataset page: https://huggingface.co/datasets/mandarjoshi/trivia_qa.
- “distribution of question_source”
- “most common values in question”
Yeast
Yeast The Yeast dataset from the UCI repository. Usage from datasets import load_dataset dataset = load_dataset("mstz/yeast")["train"] Configurations and tasks Configuration Task Description yeast Multiclass classification. yeast_0 Binary classification. Is the instance of class 0? yeast_1 Binary classification. Is the instance of class 1? yeast_2 Binary classification. Is the instance of class 2? yeast_3 Binary classification. Is the… See the full description on the dataset page: https://huggingface.co/datasets/mstz/yeast.
- “summary charts for the Yeast dataset”
- “top 10 rows of Yeast with key statistics”
Letter
Letter The Letter dataset from the UCI repository. Letter recognition. Configurations and tasks Configuration Task Description letter Multiclass classification. A Binary classification. Is this letter A? B Binary classification. Is this letter B? C Binary classification. Is this letter C? ... Binary classification. ...
- “summary charts for the Letter dataset”
- “top 10 rows of Letter with key statistics”
Spambase
Spambase The Spambase dataset from the UCI ML repository. Is the given mail spam? Configurations and tasks Configuration Task Description spambase Binary classification Is the mail spam? Usage from datasets import load_dataset dataset = load_dataset("mstz/spambase")["train"]
- “scatter word_freq_make vs word_freq_address”
- “correlation heatmap of all numeric columns”
Magic
Magic The Magic dataset from the UCI ML repository. Configurations and tasks Configuration Task Description magic Binary classification Classify the person's magic as over or under the threshold. Usage from datasets import load_dataset dataset = load_dataset("mstz/magic")["train"]
- “scatter major_axis_length vs minor_axis_length”
- “correlation heatmap of all numeric columns”
Sonar
Sonar The Sonar dataset from the UCI ML repository. Dataset to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. Configurations and tasks Configuration Task Description sonar Binary classification Is the sonar detecting a rock? Usage from datasets import load_dataset dataset = load_dataset("mstz/sonar")["train"]
- “scatter 0 vs 1”
- “correlation heatmap of all numeric columns”
Chess
Chess Rock VS Pawn The Chess Rock VS Pawn dataset from the UCI ML repository. Configurations and tasks Configuration Task Description chess Binary classification Can the white piece win? Usage from datasets import load_dataset dataset = load_dataset("mstz/chess_rock_vs_pawn")["train"]
- “summary charts for the Chess dataset”
- “top 10 rows of Chess with key statistics”
Nursery
Nursery The Nursery dataset from the UCI repository. Should the nursery school accept the student application? Configurations and tasks Configuration Task nursery Multiclass classification nursery_binary Binary classification
- “summary charts for the Nursery dataset”
- “top 10 rows of Nursery with key statistics”
Monks
Monks The Monk dataset from UCI. Configurations and tasks Configuration Task monks1 Binary classification monks2 Binary classification monks3 Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/monks", "monks1")["train"]
- “summary charts for the Monks dataset”
- “top 10 rows of Monks with key statistics”
Ionosphere
Ionosphere The Ionosphere dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their ionosphere threshold. Configurations and tasks Configuration Task Description ionosphere Binary classification Does the received signal indicate electrons in the ionosphere? Usage from datasets import load_dataset dataset = load_dataset("mstz/ionosphere")["train"]
- “scatter signal_0 vs signal_1”
- “correlation heatmap of all numeric columns”
Debatesum
DebateSum Corresponding code repo for the upcoming paper at ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset" Arxiv pre-print available here: https://arxiv.org/abs/2011.07251 Check out the presentation date and time here: https://argmining2020.i3s.unice.fr/node/9 Full paper as presented by the ACL is here: https://www.aclweb.org/anthology/2020.argmining-1.1/ Video of presentation at COLING 2020:… See the full description on the dataset page: https://huggingface.co/datasets/Hellisotherpeople/DebateSum.
- “show Unnamed: 0 over Year as a line chart”
- “average Unnamed: 0 by OriginalDebateFileName”
Legalbench
Dataset Card for Dataset Name Homepage: https://hazyresearch.stanford.edu/legalbench/ Repository: https://github.com/HazyResearch/legalbench/ Paper: https://arxiv.org/abs/2308.11462 Dataset Description Dataset Summary The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40… See the full description on the dataset page: https://huggingface.co/datasets/nguha/legalbench.
- “most common values in answer”
- “commits per language”
Social I Qa
We introduce Social IQa: Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluatin…
- “answer length distribution”
- “most common question types”
Sms Spam
Dataset Card for [Dataset Name] Dataset Summary The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam. Supported Tasks and Leaderboards [More Information Needed] Languages English Dataset Structure Data Instances [More Information… See the full description on the dataset page: https://huggingface.co/datasets/ucirvine/sms_spam.
- “most common values in sms”
- “summary charts for the Sms Spam dataset”
Wiki Bio
This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%).
- “commits per language”
- “distribution of file sizes”
Wiki Hop
WikiHop is open-domain and based on Wikipedia articles; the goal is to recover Wikidata information by hopping through documents. The goal is to answer text understanding queries by combining multiple facts that are spread across different documents.
- “answer length distribution”
- “most common question types”
Wiki Qa
Dataset Card for "wiki_qa" Dataset Summary Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
- “average label by question_id”
- “distribution of question_id”
Qasper
A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.
- “answer length distribution”
- “most common question types”
Narrativeqa
Dataset Card for Narrative QA Dataset Summary NarrativeQA is an English-lanaguage dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. Supported Tasks and Leaderboards The dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: "summaries only" and "stories only", depending on whether the human-generated summary or the full story text is used to answer the question.… See the full description on the dataset page: https://huggingface.co/datasets/deepmind/narrativeqa.
- “answer length distribution”
- “most common question types”
Eli5
Explain Like I'm 5 long form QA dataset
- “answer length distribution”
- “most common question types”
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
- “summary charts for the Reddit dataset”
- “top 10 rows of Reddit with key statistics”
Openwebtext
Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.
- “most common values in text”
- “summary charts for the Openwebtext dataset”
C4
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset by AllenAI.
- “summary charts for the C4 dataset”
- “top 10 rows of C4 with key statistics”
Wikipedia
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
- “summary charts for the Wikipedia dataset”
- “top 10 rows of Wikipedia with key statistics”
Quora
Dataset Card for "quora" Dataset Summary The Quora dataset is composed of question pairs, and the task is to determine if the questions are paraphrases of each other (have the same meaning). Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 58.17 MB Size of the generated dataset: 58.15 MB Total amount… See the full description on the dataset page: https://huggingface.co/datasets/quora-competitions/quora.
- “answer length distribution”
- “most common question types”
Stsb Multi Mt
Dataset Card for STSb Multi MT Dataset Summary STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. (source) These are different multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com. It can be used to train sentence embeddings… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt.
- “vote share by candidate”
- “turnout by region”
Opus Books
Dataset Card for OPUS Books Dataset Summary This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
- “sentiment distribution across reviews”
- “average rating by category”
Opus100
Dataset Card for OPUS-100 Dataset Summary OPUS-100 is an English-centric multilingual corpus covering 100 languages. OPUS-100 is English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English). The languages were selected based on the volume of parallel data available in OPUS. Supported Tasks and Leaderboards Translation. Languages OPUS-100 contains… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus-100.
- “summary charts for the Opus100 dataset”
- “top 10 rows of Opus100 with key statistics”
Ted Talks Iwslt
The core of WIT3 is the TED Talks corpus, that basically redistributes the original content published by the TED Conference website (http://www.ted.com). Since 2007, the TED Conference, based in California, has been posting all video recordings of its talks together with subtitles in English and their translations in more than 80 languages. Aside from its cultural and social relevance, this content, which is published under the Creative Commons BYNC-ND license, also represents a precious language resource for the machine translation research community, thanks to its size, variety of topics, and covered languages. This effort repurposes the original content in a way which is more convenient for machine translation researchers.
- “post volume over time”
- “top users by activity”
Tatoeba
This is a collection of translated sentences from Tatoeba 359 languages, 3,403 bitexts total number of files: 750 total number of tokens: 65.54M total number of sentence fragments: 8.96M
- “summary charts for the Tatoeba dataset”
- “top 10 rows of Tatoeba with key statistics”
Financial Phrasebank
The key arguments for the low utilization of statistical techniques in financial sentiment analysis have been the difficulty of implementation for practical applications and the lack of high quality training data for building such models. Especially in the case of finance and economic texts, annotated collections are a scarce resource and many are reserved for proprietary use only. To resolve the missing training data problem, we present a collection of ∼ 5000 sentences to establish human-annotated standards for benchmarking alternative modeling techniques. The objective of the phrase level annotation task was to classify each example sentence into a positive, negative or neutral category by considering only the information explicitly available in the given sentence. Since the study is fo…
- “sentiment distribution across reviews”
- “average rating by category”
Twitter Financial News Sentiment
Dataset Description The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their sentiment. The dataset holds 11,932 documents annotated with 3 labels: sentiments = { "LABEL_0": "Bearish", "LABEL_1": "Bullish", "LABEL_2": "Neutral" } The data was collected using the Twitter API. The current dataset supports the multi-class classification… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment.
- “most common values in text”
- “sentiment distribution across reviews”
Twitter Financial News Topic
Dataset Description The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their topic. The dataset holds 21,107 documents annotated with 20 labels: topics = { "LABEL_0": "Analyst Update", "LABEL_1": "Fed | Central Banks", "LABEL_2": "Company | Product News", "LABEL_3": "Treasuries | Corporate Debt", "LABEL_4": "Dividend"… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic.
- “most common values in text”
- “post volume over time”
Financial Reports Sec
The dataset contains the annual report of US public firms filing with the SEC EDGAR system. Each annual report (10K filing) is broken into 20 sections. Each section is split into individual sentences. Sentiment labels are provided on a per filing basis from the market reaction around the filing data. Additional metadata for each filing is included in the dataset.
- “sentiment distribution across reviews”
- “average rating by category”
Spotify Tracks Dataset
Content This is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. The data is in CSV format which is tabular and can be loaded quickly. Usage The dataset can be used for: Building a Recommendation System based on some user input or preference Classification purposes based on audio features and available genres Any other application that you can think of. Feel free to discuss! Column… See the full description on the dataset page: https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset.
- “show Unnamed: 0 over time_signature as a line chart”
- “average Unnamed: 0 by track_genre”
Music Genre
Dataset Card for Music Genre The Default dataset comprises approximately 1,700 musical pieces in .mp3 format, sourced from the NetEase music. The lengths of these pieces range from 270 to 300 seconds. All are sampled at the rate of 22,050 Hz. As the website providing the audio music includes style labels for the downloaded music, there are no specific annotators involved. Validation is achieved concurrently with the downloading process. They are categorized into a total of 16… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/music_genre.
- “average fst_level_label by mel”
- “scatter fst_level_label vs sec_level_label”
Libritts
Dataset Card for LibriTTS LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus. Overview This is the LibriTTS dataset, adapted… See the full description on the dataset page: https://huggingface.co/datasets/mythicinfinity/libritts.
- “clip length distribution”
- “speaker counts”
Banned Historical Archives
和谐历史档案馆数据集 - Banned Historical Archives Datasets 和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。 目录结构 banned-historical-archives.github.io # 已录入该网站的原始数据,不定期从 github 仓库中同步 raw # 原始文件 config # 配置文件 todo # 存放暂未录入网站的文件 部分报纸和图片资料存放在单独的仓库: 名称 地址 状态 参考消息 https://huggingface.co/datasets/banned-historical-archives/ckxx 未录入 人民日报 https://huggingface.co/datasets/banned-historical-archives/rmrb 已精选重要的文章录入 文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.
- “commits per language”
- “distribution of file sizes”
Cads Dataset
CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography Overview CADS is a robust, fully automated framework for segmenting 167 anatomical structures in Computed Tomography (CT), spanning from head to knee regions across diverse anatomical systems. The framework consists of two main components: CADS-dataset: 22,022 CT volumes with complete annotations for 167 anatomical structures. Most extensive whole-body CT dataset… See the full description on the dataset page: https://huggingface.co/datasets/mrmrx/CADS-dataset.
- “summary charts for the Cads Dataset dataset”
- “top 10 rows of Cads Dataset with key statistics”
Physicalai Autonomous Vehicles
PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.
- “summary charts for the Physicalai Autonomous Vehicles dataset”
- “top 10 rows of Physicalai Autonomous Vehicles with key statistics”
Ubuntu Osworld File Cache
OSWorld File Cache This repository serves as a file cache for the OSWorld project, providing reliable and fast access to evaluation files that were previously hosted on Google Drive. Overview OSWorld is a scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems and applications. This cache repository ensures that all evaluation files are consistently accessible… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache.
- “sentiment distribution across reviews”
- “average rating by category”
Results
Results on MTEB
- “summary charts for the Results dataset”
- “top 10 rows of Results with key statistics”
Gsm8K
Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.
- “answer length distribution”
- “most common question types”
Xcodeeval
The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of genera…
- “sentiment distribution across reviews”
- “average rating by category”
Swe Bench Verified
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified.
- “commits per language”
- “distribution of file sizes”
Fineweb
🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
- “show language_score over date as a line chart”
- “average language_score by dump”
Medical Qa Shared Task V1 Toy
Dataset Card for "medical-qa-shared-task-v1-toy" More Information needed
- “scatter id vs label”
- “most common values in ending0”
Openthoughts 1K Sample
[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-1k-sample This is a 1k sample of the OpenThoughts-114k dataset. Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds =… See the full description on the dataset page: https://huggingface.co/datasets/ryanmarten/OpenThoughts-1k-sample.
- “distribution of system”
- “commits per language”
Debug
test3
- “summary charts for the Debug dataset”
- “top 10 rows of Debug with key statistics”
Mmlu
Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
- “answer length distribution”
- “most common question types”
Meta Kaggle Dataset Archive 2026 03 12
Hugging Face dataset: Yarina/Meta_Kaggle_Dataset_Archive_2026-03-12
- “scatter Id vs CompetitionId”
- “correlation heatmap of all numeric columns”
Glue
Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.
- “summary charts for the Glue dataset”
- “top 10 rows of Glue with key statistics”
Commitpackft
CommitPackFT is is a 2GB filtered version of CommitPack to contain only high-quality commit messages that resemble natural language instructions.
- “summary charts for the Commitpackft dataset”
- “top 10 rows of Commitpackft with key statistics”
Ai2 Arc
Dataset Card for "ai2_arc" Dataset Summary A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.
- “answer length distribution”
- “most common question types”
Kakologarchives
ニコニコ実況 過去ログアーカイブ ニコニコ実況 過去ログアーカイブは、ニコニコ実況 のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。 去る2020年12月、ニコニコ実況は ニコニコ生放送内の一公式チャンネルとしてリニューアル されました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり(事実上のサービス終了)、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。 そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.
- “summary charts for the Kakologarchives dataset”
- “top 10 rows of Kakologarchives with key statistics”
Regions
Hugging Face dataset: world-igr-plum/regions
- “summary charts for the Regions dataset”
- “top 10 rows of Regions with key statistics”
Swe Bench Pro
Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.
- “commits per language”
- “distribution of file sizes”
Hellaswag
Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.
- “average ind by activity_label”
- “distribution of activity_label”
Droid 1.0.1
This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 95600, "total_frames": 27612581, "total_tasks": 0, "total_videos": 286800, "total_chunks": 95, "chunks_size": 1000, "fps": 15, "splits": { "train": "0:95600" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid_1.0.1.
- “show observation.state.gripper_position over date as a line chart”
- “average observation.state.gripper_position by language_instruction”
Genshin Voices Separated
Hugging Face dataset: AquaV/genshin-voices-separated
- “distribution of language”
- “most common values in transcription”
Txt360
TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.
- “distribution of subset”
- “most common values in text”
Droid
This dataset was created using LeRobot. DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset One of the biggest open-source dataset for robotics with 27.044,326 frames, 92,223 episodes, 31,308 unique task description in natural language. Ported from Tensorflow Dataset format (2TB) to LeRobotDataset format (400GB) with the help from IPEC-COMMUNITY. Visualization: LeRobot Homepage: Droid Paper: Arxiv License: apache-2.0 Dataset Structure meta/info.json: {… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid.
- “summary charts for the Droid dataset”
- “top 10 rows of Droid with key statistics”
Giftevalpretrain
GIFT-Eval Pre-training Datasets Pretraining dataset aligned with GIFT-Eval that has 71 univariate and 17 multivariate datasets, spanning seven domains and 13 frequencies, totaling 4.5 million time series and 230 billion data points. Notably this collection of data has no leakage issue with the train/test split and can be used to pretrain foundation models that can be fairly evaluated on GIFT-Eval. 📄 Paper 🖥️ Code 📔 Blog Post 🏎️ Leader Board Ethical Considerations… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain.
- “distribution of giftevalpretrain over time”
- “top 10 highest giftevalpretrain”
Llava Onevision 1.5 Instruct Data
LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.
- “distribution of llava over time”
- “top 10 highest llava”
Super Glue
Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.
- “summary charts for the Super Glue dataset”
- “top 10 rows of Super Glue with key statistics”
Fifa23
About this dataset Context The datasets provided include the players data for the Career Mode from FIFA 15 to FIFA 23. The data allows multiple comparisons for the same players across the last 9 versions of the video game. Some ideas of possible analysis: Historical comparison between Messi and Ronaldo (what skill attributes changed the most during time - compared to real-life stats); Ideal budget to create a competitive team (at the level of top n teams in Europe) and… See the full description on the dataset page: https://huggingface.co/datasets/jsulz/FIFA23.
- “scatter coach_id vs nationality_id”
- “most common values in coach_url”
Winogrande
Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.
- “summary charts for the Winogrande dataset”
- “top 10 rows of Winogrande with key statistics”
Imdb
Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.
- “most common values in text”
- “sentiment distribution across reviews”
Common Corpus
Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.
- “commits per language”
- “distribution of file sizes”
Llava Onevision 1.5 Mid Training 85M
🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.
- “class distribution across images”
- “sample sizes per category”
Gromo25
GroMo25: Multiview Time-Series Plant Image Dataset for Age Estimation and Leaf Counting Dataset Summary GroMo25 is a multiview, time-series plant image dataset designed for plant age estimation (in days) and leaf counting tasks in precision agriculture. It contains high-quality images of four crop species — Wheat, Okra, Radish, and Mustard — captured over multiple days under controlled conditions. Each plant is photographed from 24 angles across 5 vertical levels per day… See the full description on the dataset page: https://huggingface.co/datasets/MrigLabIITRopar/GroMo25.
- “scatter leaf_count vs Age”
- “most common values in filename”
Mbpp
Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.
- “average task_id by test_setup_code”
- “distribution of test_setup_code”
Proteinmpnn
Curated ProteinMPNN training dataset The multi-chain training data for ProteinMPNN Quickstart Usage Install HuggingFace Datasets package Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets Optionally set the cache directory, e.g. $ HF_HOME=${HOME}/.cache/huggingface/ $ export HF_HOME then, from within python load the datasets library >>> import datasets… See the full description on the dataset page: https://huggingface.co/datasets/RosettaCommons/ProteinMPNN.
- “summary charts for the Proteinmpnn dataset”
- “top 10 rows of Proteinmpnn with key statistics”
Openthoughts 114K
[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-114k Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.
- “distribution of system”
- “commits per language”
Kumagong
KAI0 TODO The advantage label will be coming soon. Contents About the Dataset Load the Dataset Download the Dataset Dataset Structure Folder hierarchy Details License and Citation About the Dataset ~134 hours real world scenarios Main Tasks Task_A Single task Initial state: T-shirts are randomly tossed onto the table, presenting random crumpled configurations Manipulation task: Operate the… See the full description on the dataset page: https://huggingface.co/datasets/balatubs123/kumagong.
- “show frame_index over timestamp as a line chart”
- “scatter frame_index vs episode_index”
Piqa
Hugging Face dataset: baber/piqa
- “most common values in goal”
- “answer length distribution”
Groundcua
GroundCUA: Grounding Computer Use Agents on Human Demonstrations 🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.
- “most common values in image”
- “summary charts for the Groundcua dataset”
Sharegpt Vicuna Unfiltered
Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
- “summary charts for the Sharegpt Vicuna Unfiltered dataset”
- “top 10 rows of Sharegpt Vicuna Unfiltered with key statistics”
Backup Leaderboard Data
Hugging Face dataset: genarenadata/backup-leaderboard-data
- “summary charts for the Backup Leaderboard Data dataset”
- “top 10 rows of Backup Leaderboard Data with key statistics”
Arxiv Papers By Subject
arXiv Papers by Subject A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access. Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. Motivation The original nick007x/arxiv-papers… See the full description on the dataset page: https://huggingface.co/datasets/permutans/arxiv-papers-by-subject.
- “distribution of primary_subject”
- “most common values in arxiv_id”
Swe Gym
SWE-Gym contains 2438 instances sourced from 11 Python repos, following SWE-Bench data collection procedure. Get started at project page github.com/SWE-Gym/SWE-Gym
- “distribution of repo”
- “most common values in instance_id”
Oneformer Demo
Hugging Face dataset: shi-labs/oneformer_demo
- “summary charts for the Oneformer Demo dataset”
- “top 10 rows of Oneformer Demo with key statistics”
No datasets match that search.