Datasets — Helix

Carshare dataset, build charts and dashboards with AI

“show carshare density across the map”
“which hours have the most car hours”

Election dataset, build charts and dashboards with AI

“who won the most votes by district”
“show vote share by candidate as a stacked bar”

Tips dataset, build charts and dashboards with AI

“average tip percentage by day of the week”
“compare tip amounts for smokers vs non-smokers”

Create instant charts and dashboards with AI and the gapminder inequality dataset

“life expectancy over time by continent”
“GDP per capita vs life expectancy as a bubble chart”

Iris dataset, build charts and dashboards with AI GPT4 asistant and the iconic flower dataset

“scatter petal length vs petal width coloured by species”
“boxplot sepal width by species”

Medals dataset, build charts and dashboards with AI

“total medals per country”
“gold medal ratio by country”

Stocks dataset, build charts and dashboards with AI

“price over time for each ticker”
“rolling 30 day returns per ticker”

Experiment dataset, build charts and dashboards with AI

“mean outcome by experimental group”
“distribution of outcomes by condition”

Wind speed dataset, build charts polar coordinates/directional graphs with AI

“polar plot of wind direction weighted by speed”
“wind speed distribution as a histogram”

Synthetic Hormuz shipping crisis dataset with vessel transits, energy throughput, prices, attacks, insurance premiums, and reroute impacts.

“line chart of daily ship transits with annotation at strait closure”
“scatter Brent crude vs transit pct of pre-war colored by period_type”

Regional renewable generation and grid transition dataset for testing time series, area charts, grouped bars, heatmaps, and emissions comparisons.

“line chart of renewable_share_pct over year by region”
“stacked area chart of solar_twh wind_twh and hydro_twh over year”

Ecommerce conversion funnel by cohort, channel, and device for testing funnels, Sankey-style flows, cohort trends, grouped bars, and revenue analysis.

“funnel chart of visitors signups trials_started and purchases”
“sankey chart from channel to device to purchases weighted by revenue_usd”

Monthly city climate and air-quality observations with coordinates, pollutants, temperature, and respiratory visits for maps, scatter plots, and correlations.

“map cities using latitude and longitude sized by pm25_ug_m3”
“line chart of pm25_ug_m3 over month by city”

Iris Species Dataset The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The dataset is taken from UCI Machine Learning Repository's… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/iris.

“average SepalLengthCm by Species”
“top 10 Species by total SepalLengthCm”

Adult Census Income Dataset The following was retrieved from UCI machine learning repository. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year. Description of fnlwgt (final weight)… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/adult-census-income.

“average age by workclass”
“top 10 workclass by total age”

Breast Cancer Wisconsin Diagnostic Dataset Following description was retrieved from breast cancer dataset on UCI machine learning repository. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at here. Separating plane described above was obtained using Multisurface Method-Tree (MSM-T), a classification method which uses linear… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/breast-cancer-wisconsin.

“average radius_mean by diagnosis”
“top 10 diagnosis by total radius_mean”

Heart The Heart dataset from the UCI ML repository. Does the patient have heart disease? Configurations and tasks Configuration Task hungary Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/heart", "hungary")["train"]

“top 10 rows of Heart with summary statistics”
“counts grouped by the most common field in Heart”

Adult The Adult dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold. Configurations and tasks Configuration Task Description income Binary classification Classify the person's income as over or under the threshold. income-no race Binary classification As income, but the race feature is removed. race Multiclass classification Predict the race of the individual. Usage… See the full description on the dataset page: https://huggingface.co/datasets/mstz/adult.

“average age by marital_status”
“top 10 marital_status by total age”

Wine The Wine dataset from Kaggle. Classify wine as red or white. Configurations and tasks Configuration Task Description wine Binary classification Is this red wine? Usage from datasets import load_dataset dataset = load_dataset("mstz/wine")["train"]

“scatter fixed_acidity vs volatile_acidity”
“histogram of fixed_acidity”

Titanic The Titanic dataset from Kaggle. Configurations and tasks Configuration Task Description survival Binary classification Has the passanger survived? Usage from datasets import load_dataset dataset = load_dataset("mstz/titanic")["train"]

“top 10 rows of Titanic with summary statistics”
“counts grouped by the most common field in Titanic”

Abalone The Abalone dataset from the UCI ML repository. Predict the age of the given abalone. Configurations and tasks Configuration Task Description abalone Regression Predict the age of the abalone. binary Binary classification Does the abalone have more than 9 rings? Usage from datasets import load_dataset dataset = load_dataset("mstz/abalone")["train"] Features Target feature in bold. Feature Type sex [string]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/abalone.

“average length by sex”
“top 10 sex by total length”

Car The Car dataset from the UCI repository. Classify the acceptability level of a car for resale. Configurations and tasks Configuration Task Description car Multiclass classification What is the acceptability level of the car? car_binary Binary classification Is the car acceptable? Usage from datasets import load_dataset dataset = load_dataset("mstz/car", "car_binary")["train"]

“scatter buying vs maint”
“histogram of buying”

Mushroom The Mushroom dataset from the UCI ML repository. Configurations and tasks Configuration Task Description mushroom Binary classification Is the mushroom poisonous? Usage from datasets import load_dataset dataset = load_dataset("mstz/mushroom")["train"]

“top 10 rows of Mushroom with summary statistics”
“counts grouped by the most common field in Mushroom”

Glass The Glass dataset from the UCI repository. Classify the type of glass. Configurations and tasks Configuration Task Description glass Multiclass classification Classify glass type. windows Binary classification Is this windows glass? vehicles Binary classification Is this vehicles glass? containers Binary classification Is this containers glass? tableware Binary classification Is this tableware glass? headlamps Binary classification Is this… See the full description on the dataset page: https://huggingface.co/datasets/mstz/glass.

“scatter sodium vs magnesium”
“histogram of sodium”

Dataset Card for Demo1 Dataset Summary This is a demo dataset. It consists in two files data/train.csv and data/test.csv You can load it with from datasets import load_dataset demo1 = load_dataset("lhoestq/demo1") Supported Tasks and Leaderboards [More Information Needed] Languages [More Information Needed] Dataset Structure Data Instances [More Information Needed] Data Fields [More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/lhoestq/demo1.

“line chart of star over date”
“average star by package_name”

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

“top 10 rows of Custom Squad with summary statistics”
“counts grouped by the most common field in Custom Squad”

reddit finance 43 250k reddit_finance_43_250k is a collection of 250k post/comment pairs from 43 financial, investing and crypto subreddits. Post must have all been text, with a length of 250chars, and a positive score. Each subreddit is narrowed down to the 70th qunatile before being mergered with their top 3 comments and than the other subs. Further score based methods are used to select the top 250k post/comment pairs. The code to recreate the dataset is here:… See the full description on the dataset page: https://huggingface.co/datasets/winddude/reddit_finance_43_250k.

“scatter z_score vs normalized_score”
“histogram of z_score”

Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024) This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/finance-tasks.

“top 10 rows of Finance Tasks with summary statistics”
“counts grouped by the most common field in Finance Tasks”

Dataset Creation This dataset combines financial phrasebank dataset and a financial text dataset from Kaggle. Given the financial phrasebank dataset does not have a validation split, I thought this might help to validate finance models and also capture the impact of COVID on financial earnings with the more recent Kaggle dataset.

“histogram of labels”
“most common values in text”

This dataset contains the subset of ArXiv papers with the "cs.LG" tag to indicate the paper is about Machine Learning. The core dataset is filtered from the full ArXiv dataset hosted on Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv. The original dataset contains roughly 2 million papers. This dataset contains roughly 100,000 papers following the category filtering. The dataset is maintained by with requests to the ArXiv API. The current iteration of the dataset only contains… See the full description on the dataset page: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers.

“scatter Unnamed: 0.1 vs Unnamed: 0”
“histogram of Unnamed: 0.1”

Hugging Face dataset: jamescalam/ai-arxiv-chunked

“count of records over updated”
“count of records by doi”

The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

“top 10 rows of Github Code Clean with summary statistics”
“counts grouped by the most common field in Github Code Clean”

Dataset description This dataset consists of sequences of Python code followed by a a docstring explaining its function. It was constructed by concatenating code and text pairs from this dataset that were originally code and markdown cells in Jupyter Notebooks. The content of each example the following: [CODE] """ Explanation: [TEXT] End of explanation """ [CODE] """ Explanation: [TEXT] End of explanation """ ... How to use it from datasets import load_dataset ds =… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text.

“most common values in repo_name”
“length distribution of repo_name”

Dataset Description A small subset (~0.1%) of the-stack dataset, each programming language has 10,000 random samples from the original dataset. The dataset has 2.6GB of text (code). Languages The dataset contains 30 programming languages: "assembly", "batchfile", "c++", "c", "c-sharp", "cmake", "css", "dockerfile", "fortran", "go", "haskell", "html", "java", "javascript", "julia", "lua", "makefile", "markdown", "perl", "php", "powershell", "python", "ruby", "rust"… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-smol.

“top 10 rows of The Stack Smol with summary statistics”
“counts grouped by the most common field in The Stack Smol”

APPS is a benchmark for Python code generation, it includes 10,000 problems, which range from having simple oneline solutions to being substantial algorithmic challenges, for more details please refer to this paper: https://arxiv.org/pdf/2105.09938.pdf.

“top 10 rows of Apps with summary statistics”
“counts grouped by the most common field in Apps”

Dataset containing city, country, region, and continents alongside their longitude and latitude co-ordinates. Cartesian coordinates are provided in x, y, z features.

“map the points using latitude and longitude coloured by country”
“average x by country”

Dataset containing video metadata from a few tech channels, i.e. James Briggs Yannic Kilcher sentdex Daniel Bourke AI Coffee Break with Letitia Alex Ziskind

“line chart of Like Count over Time Created”
“average Like Count by Channel ID”

The YouTube transcriptions dataset contains technical tutorials (currently from James Briggs, Daniel Bourke, and AI Coffee Break) transcribed using OpenAI's Whisper (large). Each row represents roughly a sentence-length chunk of text alongside the video URL and timestamp. Note that each item in the dataset contains just a short chunk of text. For most use cases you will likely need to merge multiple rows to create more substantial chunks of text, if you need to do that, this code snippet will… See the full description on the dataset page: https://huggingface.co/datasets/jamescalam/youtube-transcriptions.

“average start by title”
“top 10 title by total start”

This dataset splits the original CodeAlpaca dataset into train and test splits.

“most common values in prompt”
“length distribution of prompt”

Dataset Card for MedQA Dataset Summary This is the data and baseline source code for the paper: Jin, Di, et al. "What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams." From https://github.com/jind11/MedQA: The data that contains both the QAs and textbooks can be downloaded from this google drive folder. A bit of details of data are explained as below: For QAs, we have three sources: US, Mainland of China, and… See the full description on the dataset page: https://huggingface.co/datasets/medalpaca/medical_meadow_medqa.

“count of records by instruction”
“distribution of instruction”

Dataset Card for [Dataset Name] Dataset Summary This data set contains over 6,000 medical terms and their wikipedia text. It is intended to be used on a downstream task that requires medical terms and their wikipedia explanation. Dataset Structure Data Instances [More Information Needed] Data Fields [More Information Needed] Data Splits [More Information Needed] Dataset Creation Curation Rationale [More… See the full description on the dataset page: https://huggingface.co/datasets/gamino/wiki_medical_terms.

“histogram of __index_level_0__”
“most common values in page_title”

all-processed dataset is a concatenation of of medical-meadow-* and chatdoctor_healthcaremagic datasets The Chat Doctor term is replaced by the chatbot term in the chatdoctor_healthcaremagic dataset Similar to the literature the medical_meadow_cord19 dataset is subsampled to 50,000 samples truthful-qa-* is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to Health and… See the full description on the dataset page: https://huggingface.co/datasets/lavita/medical-qa-datasets.

“top 10 rows of Medical Qa Datasets with summary statistics”
“counts grouped by the most common field in Medical Qa Datasets”

Hugging Face dataset: Nicolybgs/healthcare_data

“line chart of Available Extra Rooms in Hospital over Stay (in days)”
“average Available Extra Rooms in Hospital by Department”

Dataset Labels ['NORMAL', 'PNEUMONIA'] Number of Images {'train': 4077, 'test': 582, 'valid': 1165} How to Use Install datasets: pip install datasets Load the dataset: from datasets import load_dataset ds = load_dataset("keremberke/chest-xray-classification", name="full") example = ds['train'][0] Roboflow Dataset Page https://universe.roboflow.com/mohamed-traore-2ekkp/chest-x-rays-qjmia/dataset/2 Citation… See the full description on the dataset page: https://huggingface.co/datasets/keremberke/chest-xray-classification.

“top 10 rows of Chest Xray Classification with summary statistics”
“counts grouped by the most common field in Chest Xray Classification”

Dataset Card for climate_sentiment Dataset Summary We introduce an expert-annotated dataset for classifying climate-related sentiment of climate-related paragraphs in corporate disclosures. Supported Tasks and Leaderboards The dataset supports a ternary sentiment classification task of whether a given climate-related paragraph has sentiment opportunity, neutral, or risk. Languages The text in the dataset is in English. Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/climate_sentiment.

“histogram of label”
“most common values in text”

Dataset Card for climate_detection Dataset Summary We introduce an expert-annotated dataset for detecting climate-related paragraphs in corporate disclosures. Supported Tasks and Leaderboards The dataset supports a binary classification task of whether a given paragraph is climate-related or not. Languages The text in the dataset is in English. Dataset Structure Data Instances { 'text': '− Scope 3: Optional scope that includes… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/climate_detection.

“histogram of label”
“most common values in text”

Dataset Card for environmental_claims Dataset Summary We introduce an expert-annotated dataset for detecting real-world environmental claims made by listed companies. Supported Tasks and Leaderboards The dataset supports a binary classification task of whether a given sentence is an environmental claim or not. Languages The text in the dataset is in English. Dataset Structure Data Instances { "text": "It will enable E.ON to… See the full description on the dataset page: https://huggingface.co/datasets/climatebert/environmental_claims.

“histogram of label”
“most common values in text”

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

“histogram of label”
“most common values in text”

Dataset Card for "ag_news" Dataset Summary AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/ag_news.

“histogram of label”
“most common values in text”

Dataset Card for Amazon Review Polarity Dataset Summary The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. Supported Tasks and Leaderboards text-classification, sentiment-classification: The dataset is mainly used for text classification: given the content and the title, predict the correct… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/amazon_polarity.

“histogram of label”
“most common values in title”

Dataset Card for BANKING77 Dataset Summary Deprecated: Dataset "banking77" is deprecated and will be deleted. Use "PolyAI/banking77" instead. Dataset composed of online banking queries annotated with their corresponding intents. BANKING77 dataset provides a very fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents. It focuses on fine-grained single-domain intent detection. Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/legacy-datasets/banking77.

“histogram of label”
“most common values in text”

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

“scatter label vs idx”
“histogram of label”

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

“scatter label vs idx”
“histogram of label”

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

“scatter label vs idx”
“histogram of label”

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

“scatter label vs idx”
“histogram of label”

Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.

“scatter idx vs label”
“histogram of idx”

Dataset Card for SQuAD Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. Supported Tasks and Leaderboards Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

“count of records by title”
“distribution of title”

Dataset Card for SQuAD 2.0 Dataset Summary Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.

“count of records by title”
“distribution of title”

Dataset Card for "wmt16" Dataset Summary Warning: There are issues with the Common Crawl corpus data (training-parallel-commoncrawl.tgz): Non-English files contain many English sentences. Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart. We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains… See the full description on the dataset page: https://huggingface.co/datasets/wmt/wmt16.

“top 10 rows of Wmt16 with summary statistics”
“counts grouped by the most common field in Wmt16”

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the sec…

“top 10 rows of Conll2003 with summary statistics”
“counts grouped by the most common field in Conll2003”

Dataset Card for "rotten_tomatoes" Dataset Summary Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005. Supported Tasks and Leaderboards More Information Needed Languages… See the full description on the dataset page: https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes.

“histogram of label”
“most common values in text”

Dataset Card for "yelp_polarity" Dataset Summary Large Yelp Review Dataset. This is a dataset for binary sentiment classification. We provide a set of 560,000 highly polar yelp reviews for training, and 38,000 for testing. ORIGIN The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. For more information, please refer to http://www.yelp.com/dataset_challenge The Yelp reviews polarity dataset is constructed by… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/yelp_polarity.

“histogram of label”
“most common values in text”

The Text REtrieval Conference (TREC) Question Classification dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 coarse class labels and 50 fine class labels. Average length of each sentence is 10, vocabulary size of 8700. Data are collected from four sources: 4,500 English questions published by USC (Hovy et al., 2001), about 500 manually constructed questions for a few rare classes, 894 TREC 8 and TREC 9 questions, and also 500 questions from TREC 10 which serves as the test set. These questions were manually labeled.

“top 10 rows of Trec with summary statistics”
“counts grouped by the most common field in Trec”

Dataset Card for DBpedia14 Dataset Summary The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000. There are 3 columns in the dataset (same for train and test splits), corresponding to… See the full description on the dataset page: https://huggingface.co/datasets/fancyzhx/dbpedia_14.

“histogram of label”
“most common values in title”

Dataset Card for "emotion" Dataset Summary Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances An example looks as follows. { "text": "im feeling quite sad and sorry for myself but… See the full description on the dataset page: https://huggingface.co/datasets/dair-ai/emotion.

“histogram of label”
“most common values in text”

Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.

“histogram of label”
“most common values in text”

Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.

“histogram of label”
“most common values in text”

Dataset Card for tweet_eval Dataset Summary TweetEval consists of seven heterogenous tasks in Twitter, all framed as multi-class tweet classification. The tasks include - irony, hate, offensive, stance, emoji, emotion, and sentiment. All tasks have been unified into the same benchmark, with each dataset presented in the same format and with fixed training, validation and test splits. Supported Tasks and Leaderboards text_classification: The dataset can be… See the full description on the dataset page: https://huggingface.co/datasets/cardiffnlp/tweet_eval.

“histogram of label”
“most common values in text”

Dataset Card for GoEmotions Dataset Summary The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral. The raw data is included as well as the smaller, simplified version of the dataset with predefined train/val/test splits. Supported Tasks and Leaderboards This dataset is intended for multi-class, multi-label emotion classification. Languages The data is in English. Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/go_emotions.

“most common values in text”
“length distribution of text”

Dataset Card for SNLI Dataset Summary The SNLI corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of natural language inference (NLI), also known as recognizing textual entailment (RTE). Supported Tasks and Leaderboards Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is the… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/snli.

“histogram of label”
“most common values in premise”

Dataset Card for Multi-Genre Natural Language Inference (MultiNLI) Dataset Summary The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/multi_nli.

“average promptID by genre”
“top 10 genre by total promptID”

Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.

“average ind by activity_label”
“top 10 activity_label by total ind”

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to state-of-the-art natural language understanding systems. The PIQA dataset introduces the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Physical commonsense knowledge is a major challenge on the road to true AI-completeness, including robots that interact with the world and understand natural language. PIQA focuses on everyday situations with a preference for atypical solutions. The dataset is inspired by instructables.com, which provides users with instructions on how to build, craft, bake, or manipulate objects using everyday materials. The underlying task…

“top 10 rows of Piqa with summary statistics”
“counts grouped by the most common field in Piqa”

Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

“count of records by answer”
“distribution of answer”

Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

“top 10 rows of Openai Humaneval with summary statistics”
“counts grouped by the most common field in Openai Humaneval”

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

“average task_id by test_setup_code”
“top 10 test_setup_code by total task_id”

Dataset Card for LAMBADA Dataset Summary The LAMBADA evaluates the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local… See the full description on the dataset page: https://huggingface.co/datasets/cimec/lambada.

“count of records by domain”
“distribution of domain”

Dataset Card for MNIST Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students… See the full description on the dataset page: https://huggingface.co/datasets/ylecun/mnist.

“average label by image”
“top 10 image by total label”

Dataset Card for CIFAR-10 Dataset Summary The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar10.

“average label by img”
“top 10 img by total label”

Dataset Card for CIFAR-100 Dataset Summary The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class. There are 500 training images and 100 testing images per class. There are 50000 training images and 10000 test images. The 100 classes are grouped into 20 superclasses. There are two labels per image - fine label (actual class) and coarse label (superclass). Supported Tasks and Leaderboards image-classification: The… See the full description on the dataset page: https://huggingface.co/datasets/uoft-cs/cifar100.

“average fine_label by img”
“top 10 img by total fine_label”

Dataset Card for FashionMNIST Dataset Summary Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing… See the full description on the dataset page: https://huggingface.co/datasets/zalando-datasets/fashion_mnist.

“average label by image”
“top 10 image by total label”

Dataset Card for Food-101 Dataset Summary This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels. Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/ethz/food101.

“histogram of label”
“most common values in image”

This is a set of one-second .wav audio files, each containing a single spoken English word or background noise. These words are from a small set of commands, and are spoken by a variety of different speakers. This data set is designed to help train simple machine learning models. This dataset is covered in more detail at [https://arxiv.org/abs/1804.03209](https://arxiv.org/abs/1804.03209). Version 0.01 of the data set (configuration `"v0.01"`) was released on August 3rd 2017 and contains 64,727 audio files. In version 0.01 thirty different words were recoded: "Yes", "No", "Up", "Down", "Left", "Right", "On", "Off", "Stop", "Go", "Zero", "One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight", "Nine", "Bed", "Bird", "Cat", "Dog", "Happy", "House", "Marvin", "Sheila", "Tree", "Wow".…

“top 10 rows of Speech Commands with summary statistics”
“counts grouped by the most common field in Speech Commands”

Common Voice is Mozilla's initiative to help teach machines how real people speak. The dataset currently consists of 7,335 validated hours of speech in 60 languages, but we’re always adding more voices and languages.

“top 10 rows of Common Voice with summary statistics”
“counts grouped by the most common field in Common Voice”

A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

“top 10 rows of Arxiv Dataset with summary statistics”
“counts grouped by the most common field in Arxiv Dataset”

Arxiv Classification: a classification of Arxiv Papers (11 classes). This dataset is intended for long context classification (documents have all > 4k tokens). Copied from "Long Document Classification From Local Word Glimpses via Recurrent Attention Learning" @ARTICLE{8675939, author={He, Jun and Wang, Liqun and Liu, Liu and Feng, Jiao and Wu, Hao}, journal={IEEE Access}, title={Long Document Classification From Local Word Glimpses via Recurrent Attention Learning}, year={2019}… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-classification.

“histogram of label”
“most common values in text”

Arxiver Dataset Arxiver consists of 63,357 arXiv papers converted to multi-markdown (.mmd) format. Our dataset includes original arXiv article IDs, titles, abstracts, authors, publication dates, URLs and corresponding markdown files published between January 2023 and October 2023. We hope our dataset will be useful for various applications such as semantic search, domain specific language modeling, question answering and summarization. Curation The Arxiver dataset is… See the full description on the dataset page: https://huggingface.co/datasets/neuralwork/arxiver.

“count of records over published_date”
“most common values in id”

We provide an Amazon product reviews dataset for multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.) The corpus is balanced across stars, so each star rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000 and 5,000 reviews in the training, development and test sets respectively. The maximum number of reviews per reviewer is 20 and the maximum number of reviews per product is 20. All reviews are truncated aft…

“top 10 rows of Amazon Reviews Multi with summary statistics”
“counts grouped by the most common field in Amazon Reviews Multi”

Dataset Card for YelpReviewFull Dataset Summary The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. Supported Tasks and Leaderboards text-classification, sentiment-classification: The dataset is mainly used for text classification: given the text, predict the sentiment. Languages The reviews were mainly written in english. Dataset Structure Data Instances A… See the full description on the dataset page: https://huggingface.co/datasets/Yelp/yelp_review_full.

“histogram of label”
“most common values in text”

Dataset Card for Gutenberg Poem Dataset Dataset Summary Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems. Supported Tasks and Leaderboards [More Information Needed] Languages The text in the dataset is in English (en). Dataset Structure Data Instances Example of one instance in the dataset. {'id': 0… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/poem_sentiment.

“histogram of label”
“most common values in verse_text”

Dataset Card for CNN Dailymail Dataset Dataset Summary The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. Supported Tasks and Leaderboards 'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.

“most common values in article”
“length distribution of article”

Dataset Card for "xsum" Dataset Summary Extreme Summarization (XSum) Dataset. There are three features: document: Input news article. summary: One sentence summary of the article. id: BBC ID of the article. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 257.30 MB Size of the generated dataset:… See the full description on the dataset page: https://huggingface.co/datasets/EdinburghNLP/xsum.

“most common values in document”
“length distribution of document”

NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. Dataset features includes: - text: Input news text. - summary: Summary for the news. And additional features: - title: news title. - url: url of the news. - date: date of the article. - density: extractive density. - coverage: extractive coverage. - compression: compression ratio. - density_bin: low, medium, high. - coverage_bin: extractive, abstractive. - compression_bin: low, medium, high. This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.

“top 10 rows of Newsroom with summary statistics”
“counts grouped by the most common field in Newsroom”

Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. There are two features: - document: text of news articles seperated by special token "|||||". - summary: news summary.

“top 10 rows of Multi News with summary statistics”
“counts grouped by the most common field in Multi News”

Dataset Card for "LexGLUE" Dataset Summary Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2019), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a benchmark dataset to evaluate… See the full description on the dataset page: https://huggingface.co/datasets/coastalcph/lex_glue.

“top 10 rows of Lex Glue with summary statistics”
“counts grouped by the most common field in Lex Glue”

Dataset Card for "billsum" Dataset Summary BillSum, summarization of US Congressional and California state bills. There are several features: text: bill text. summary: summary of the bills. title: title of the bills. features for us bills. ca bills does not have. text_len: number of chars in text. sum_len: number of chars in summary. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/FiscalNote/billsum.

“most common values in text”
“length distribution of text”

Dataset Card for OpenAI HumanEval Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. Supported Tasks and Leaderboards Languages The programming problems are written in Python and contain English natural text in comments and docstrings.… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

“top 10 rows of Openai Humaneval with summary statistics”
“counts grouped by the most common field in Openai Humaneval”

CodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps: Deduplication Remove exact matches Filtering Average line length < 100 Maximum line length < 1000 Alpha numeric characters fraction > 0.25 Remove auto-generated files (keyword search) For… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-clean.

“average line_mean by license”
“top 10 license by total line_mean”

CoNaLa is a dataset of code and natural language pairs crawled from Stack Overflow, for more details please refer to this paper: https://arxiv.org/pdf/1805.08949.pdf or the dataset page https://conala-corpus.github.io/.

“top 10 rows of Conala with summary statistics”
“counts grouped by the most common field in Conala”

Image generated by DALL-E. See prompt for more details synthetic_text_to_sql gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes: 105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

“average id by sql_complexity”
“top 10 sql_complexity by total id”

Overview This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.

“most common values in answer”
“length distribution of answer”

Dataset Card for "wikitext" Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

“most common values in text”
“length distribution of text”

Dataset Card for "oasst1_pairwise_rlhf_reward" OASST1 dataset preprocessed for reward modeling: import pandas as pd from datasets import load_dataset,concatenate_datasets, Dataset, DatasetDict import numpy as np dataset = load_dataset("OpenAssistant/oasst1") df=concatenate_datasets(list(dataset.values())).to_pandas() m2t=df.set_index("message_id")['text'].to_dict() m2r=df.set_index("message_id")['role'].to_dict() m2p=df.set_index('message_id')['parent_id'].to_dict()… See the full description on the dataset page: https://huggingface.co/datasets/tasksource/oasst1_pairwise_rlhf_reward.

“count of records by lang”
“distribution of lang”

Default of Credit Card Clients Dataset The following was retrieved from UCI machine learning repository. Dataset Information This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. Content There are 25 variables: ID: ID of each client LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit SEX:… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/credit-card-clients.

“line chart of LIMIT_BAL over default.payment.next.month”
“scatter LIMIT_BAL vs SEX”

Auto Miles per Gallon (MPG) Dataset Following description was taken from UCI machine learning repository. Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition. Data Set Information: This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/auto-mpg.

“line chart of mpg over model year”
“scatter mpg vs cylinders”

a.k.a. Awesome ChatGPT Prompts This is a Dataset Repository mirror of prompts.chat — a social platform for AI prompts. 📢 Notice This Hugging Face dataset is a mirror. For the latest prompts, features, and community contributions, please visit: 🌐 Website: prompts.chat 📦 GitHub: github.com/f/awesome-chatgpt-prompts About prompts.chat is an open-source platform where users can share, discover, and collect AI prompts from the community. The project can be… See the full description on the dataset page: https://huggingface.co/datasets/fka/prompts.chat.

“count of records by type”
“distribution of type”

DialogStudio: Unified Dialog Datasets and Instruction-Aware Models for Conversational AI Author: Jianguo Zhang, Kun Qian Paper|Github|[GDrive] 🎉 March 18, 2024: Update for AI Agent. Check xLAM for the latest data and models relevant to AI Agent! 🎉 March 10 2024: Update for dataset viewer issues: Please refer to https://github.com/salesforce/DialogStudio for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. For… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/dialogstudio.

“top 10 rows of Dialogstudio with summary statistics”
“counts grouped by the most common field in Dialogstudio”

🦣 MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning MathInstruct is a meticulously curated instruction tuning dataset that is lightweight yet generalizable. MathInstruct is compiled from 13 math rationale datasets, six of which are newly curated by this work. It uniquely focuses on the hybrid use of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and ensures extensive coverage of diverse mathematical fields. Project Page:… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/MathInstruct.

“most common values in source”
“length distribution of source”

Our dataset is gathered by using a new representation language to annotate over the AQuA-RAT dataset. AQuA-RAT has provided the questions, options, rationale, and the correct options.

“top 10 rows of Math Qa with summary statistics”
“counts grouped by the most common field in Math Qa”

The Mathematics Aptitude Test of Heuristics (MATH) dataset consists of problems from mathematics competitions, including the AMC 10, AMC 12, AIME, and more. Each problem in MATH has a full step-by-step solution, which can be used to teach models to generate answer derivations and explanations.

“top 10 rows of Competition Math with summary statistics”
“counts grouped by the most common field in Competition Math”

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

“most common values in question”
“length distribution of question”

Dataset Card for Natural Questions Dataset Summary The NQ corpus contains questions from real users, and it requires QA systems to read and comprehend an entire Wikipedia article that may or may not contain the answer to the question. The inclusion of real user questions, and the requirement that solutions should read an entire page to find the answer, cause NQ to be a more realistic and challenging task than prior QA datasets. Supported Tasks and Leaderboards… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/natural_questions.

“count of records over long_answer_candidates”
“most common values in id”

Dataset Card for "hotpot_qa" Dataset Summary HotpotQA is a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowingQA systems to reason… See the full description on the dataset page: https://huggingface.co/datasets/hotpotqa/hotpot_qa.

“count of records by type”
“distribution of type”

Dataset Card for "trivia_qa" Dataset Summary TriviaqQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. Supported Tasks and Leaderboards More Information Needed Languages English.… See the full description on the dataset page: https://huggingface.co/datasets/mandarjoshi/trivia_qa.

“count of records by question_source”
“distribution of question_source”

Yeast The Yeast dataset from the UCI repository. Usage from datasets import load_dataset dataset = load_dataset("mstz/yeast")["train"] Configurations and tasks Configuration Task Description yeast Multiclass classification. yeast_0 Binary classification. Is the instance of class 0? yeast_1 Binary classification. Is the instance of class 1? yeast_2 Binary classification. Is the instance of class 2? yeast_3 Binary classification. Is the… See the full description on the dataset page: https://huggingface.co/datasets/mstz/yeast.

“top 10 rows of Yeast with summary statistics”
“counts grouped by the most common field in Yeast”

Letter The Letter dataset from the UCI repository. Letter recognition. Configurations and tasks Configuration Task Description letter Multiclass classification. A Binary classification. Is this letter A? B Binary classification. Is this letter B? C Binary classification. Is this letter C? ... Binary classification. ...

“top 10 rows of Letter with summary statistics”
“counts grouped by the most common field in Letter”

Spambase The Spambase dataset from the UCI ML repository. Is the given mail spam? Configurations and tasks Configuration Task Description spambase Binary classification Is the mail spam? Usage from datasets import load_dataset dataset = load_dataset("mstz/spambase")["train"]

“scatter word_freq_make vs word_freq_address”
“histogram of word_freq_make”

Magic The Magic dataset from the UCI ML repository. Configurations and tasks Configuration Task Description magic Binary classification Classify the person's magic as over or under the threshold. Usage from datasets import load_dataset dataset = load_dataset("mstz/magic")["train"]

“scatter major_axis_length vs minor_axis_length”
“histogram of major_axis_length”

Sonar The Sonar dataset from the UCI ML repository. Dataset to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. Configurations and tasks Configuration Task Description sonar Binary classification Is the sonar detecting a rock? Usage from datasets import load_dataset dataset = load_dataset("mstz/sonar")["train"]

“scatter 0 vs 1”
“histogram of 0”

Chess Rock VS Pawn The Chess Rock VS Pawn dataset from the UCI ML repository. Configurations and tasks Configuration Task Description chess Binary classification Can the white piece win? Usage from datasets import load_dataset dataset = load_dataset("mstz/chess_rock_vs_pawn")["train"]

“top 10 rows of Chess with summary statistics”
“counts grouped by the most common field in Chess”

Nursery The Nursery dataset from the UCI repository. Should the nursery school accept the student application? Configurations and tasks Configuration Task nursery Multiclass classification nursery_binary Binary classification

“top 10 rows of Nursery with summary statistics”
“counts grouped by the most common field in Nursery”

Monks The Monk dataset from UCI. Configurations and tasks Configuration Task monks1 Binary classification monks2 Binary classification monks3 Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/monks", "monks1")["train"]

“top 10 rows of Monks with summary statistics”
“counts grouped by the most common field in Monks”

Ionosphere The Ionosphere dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their ionosphere threshold. Configurations and tasks Configuration Task Description ionosphere Binary classification Does the received signal indicate electrons in the ionosphere? Usage from datasets import load_dataset dataset = load_dataset("mstz/ionosphere")["train"]

“scatter signal_0 vs signal_1”
“histogram of signal_0”

DebateSum Corresponding code repo for the upcoming paper at ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset" Arxiv pre-print available here: https://arxiv.org/abs/2011.07251 Check out the presentation date and time here: https://argmining2020.i3s.unice.fr/node/9 Full paper as presented by the ACL is here: https://www.aclweb.org/anthology/2020.argmining-1.1/ Video of presentation at COLING 2020:… See the full description on the dataset page: https://huggingface.co/datasets/Hellisotherpeople/DebateSum.

“line chart of Unnamed: 0 over Year”
“average Unnamed: 0 by OriginalDebateFileName”

Dataset Card for Dataset Name Homepage: https://hazyresearch.stanford.edu/legalbench/ Repository: https://github.com/HazyResearch/legalbench/ Paper: https://arxiv.org/abs/2308.11462 Dataset Description Dataset Summary The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40… See the full description on the dataset page: https://huggingface.co/datasets/nguha/legalbench.

“histogram of index”
“most common values in answer”

We introduce Social IQa: Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluatin…

“top 10 rows of Social I Qa with summary statistics”
“counts grouped by the most common field in Social I Qa”

Dataset Card for [Dataset Name] Dataset Summary The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam. Supported Tasks and Leaderboards [More Information Needed] Languages English Dataset Structure Data Instances [More Information… See the full description on the dataset page: https://huggingface.co/datasets/ucirvine/sms_spam.

“histogram of label”
“most common values in sms”

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/) to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%).

“top 10 rows of Wiki Bio with summary statistics”
“counts grouped by the most common field in Wiki Bio”

WikiHop is open-domain and based on Wikipedia articles; the goal is to recover Wikidata information by hopping through documents. The goal is to answer text understanding queries by combining multiple facts that are spread across different documents.

“top 10 rows of Wiki Hop with summary statistics”
“counts grouped by the most common field in Wiki Hop”

Dataset Card for "wiki_qa" Dataset Summary Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.

“average label by question_id”
“top 10 question_id by total label”

A dataset containing 1585 papers with 5049 information-seeking questions asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners.

“top 10 rows of Qasper with summary statistics”
“counts grouped by the most common field in Qasper”

Dataset Card for Narrative QA Dataset Summary NarrativeQA is an English-lanaguage dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents. Supported Tasks and Leaderboards The dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: "summaries only" and "stories only", depending on whether the human-generated summary or the full story text is used to answer the question.… See the full description on the dataset page: https://huggingface.co/datasets/deepmind/narrativeqa.

“top 10 rows of Narrativeqa with summary statistics”
“counts grouped by the most common field in Narrativeqa”

Explain Like I'm 5 long form QA dataset

“top 10 rows of Eli5 with summary statistics”
“counts grouped by the most common field in Eli5”

This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.

“top 10 rows of Reddit with summary statistics”
“counts grouped by the most common field in Reddit”

Dataset Card for "openwebtext" Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances plain_text Size of downloaded dataset files: 13.51 GB Size of the… See the full description on the dataset page: https://huggingface.co/datasets/Skylion007/openwebtext.

“most common values in text”
“length distribution of text”

A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset by AllenAI.

“top 10 rows of C4 with summary statistics”
“counts grouped by the most common field in C4”

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

“top 10 rows of Wikipedia with summary statistics”
“counts grouped by the most common field in Wikipedia”

Dataset Card for "quora" Dataset Summary The Quora dataset is composed of question pairs, and the task is to determine if the questions are paraphrases of each other (have the same meaning). Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 58.17 MB Size of the generated dataset: 58.15 MB Total amount… See the full description on the dataset page: https://huggingface.co/datasets/quora-competitions/quora.

“top 10 rows of Quora with summary statistics”
“counts grouped by the most common field in Quora”

Dataset Card for STSb Multi MT Dataset Summary STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums. (source) These are different multilingual translations and the English original of the STSbenchmark dataset. Translation has been done with deepl.com. It can be used to train sentence embeddings… See the full description on the dataset page: https://huggingface.co/datasets/PhilipMay/stsb_multi_mt.

“top 10 rows of Stsb Multi Mt with summary statistics”
“counts grouped by the most common field in Stsb Multi Mt”

Dataset Card for OPUS Books Dataset Summary This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

“top 10 rows of Opus Books with summary statistics”
“counts grouped by the most common field in Opus Books”

Dataset Card for OPUS-100 Dataset Summary OPUS-100 is an English-centric multilingual corpus covering 100 languages. OPUS-100 is English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English). The languages were selected based on the volume of parallel data available in OPUS. Supported Tasks and Leaderboards Translation. Languages OPUS-100 contains… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus-100.

“top 10 rows of Opus100 with summary statistics”
“counts grouped by the most common field in Opus100”

The core of WIT3 is the TED Talks corpus, that basically redistributes the original content published by the TED Conference website (http://www.ted.com). Since 2007, the TED Conference, based in California, has been posting all video recordings of its talks together with subtitles in English and their translations in more than 80 languages. Aside from its cultural and social relevance, this content, which is published under the Creative Commons BYNC-ND license, also represents a precious language resource for the machine translation research community, thanks to its size, variety of topics, and covered languages. This effort repurposes the original content in a way which is more convenient for machine translation researchers.

“top 10 rows of Ted Talks Iwslt with summary statistics”
“counts grouped by the most common field in Ted Talks Iwslt”

This is a collection of translated sentences from Tatoeba 359 languages, 3,403 bitexts total number of files: 750 total number of tokens: 65.54M total number of sentence fragments: 8.96M

“top 10 rows of Tatoeba with summary statistics”
“counts grouped by the most common field in Tatoeba”

The key arguments for the low utilization of statistical techniques in financial sentiment analysis have been the difficulty of implementation for practical applications and the lack of high quality training data for building such models. Especially in the case of finance and economic texts, annotated collections are a scarce resource and many are reserved for proprietary use only. To resolve the missing training data problem, we present a collection of ∼ 5000 sentences to establish human-annotated standards for benchmarking alternative modeling techniques. The objective of the phrase level annotation task was to classify each example sentence into a positive, negative or neutral category by considering only the information explicitly available in the given sentence. Since the study is fo…

“top 10 rows of Financial Phrasebank with summary statistics”
“counts grouped by the most common field in Financial Phrasebank”

Dataset Description The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their sentiment. The dataset holds 11,932 documents annotated with 3 labels: sentiments = { "LABEL_0": "Bearish", "LABEL_1": "Bullish", "LABEL_2": "Neutral" } The data was collected using the Twitter API. The current dataset supports the multi-class classification… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment.

“histogram of label”
“most common values in text”

Dataset Description The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their topic. The dataset holds 21,107 documents annotated with 20 labels: topics = { "LABEL_0": "Analyst Update", "LABEL_1": "Fed | Central Banks", "LABEL_2": "Company | Product News", "LABEL_3": "Treasuries | Corporate Debt", "LABEL_4": "Dividend"… See the full description on the dataset page: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic.

“histogram of label”
“most common values in text”

The dataset contains the annual report of US public firms filing with the SEC EDGAR system. Each annual report (10K filing) is broken into 20 sections. Each section is split into individual sentences. Sentiment labels are provided on a per filing basis from the market reaction around the filing data. Additional metadata for each filing is included in the dataset.

“top 10 rows of Financial Reports Sec with summary statistics”
“counts grouped by the most common field in Financial Reports Sec”

Content This is a dataset of Spotify tracks over a range of 125 different genres. Each track has some audio features associated with it. The data is in CSV format which is tabular and can be loaded quickly. Usage The dataset can be used for: Building a Recommendation System based on some user input or preference Classification purposes based on audio features and available genres Any other application that you can think of. Feel free to discuss! Column… See the full description on the dataset page: https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset.

“line chart of Unnamed: 0 over time_signature”
“average Unnamed: 0 by track_genre”

Dataset Card for Music Genre The Default dataset comprises approximately 1,700 musical pieces in .mp3 format, sourced from the NetEase music. The lengths of these pieces range from 270 to 300 seconds. All are sampled at the rate of 22,050 Hz. As the website providing the audio music includes style labels for the downloaded music, there are no specific annotators involved. Validation is achieved concurrently with the downloading process. They are categorized into a total of 16… See the full description on the dataset page: https://huggingface.co/datasets/ccmusic-database/music_genre.

“average fst_level_label by mel”
“top 10 mel by total fst_level_label”

Dataset Card for LibriTTS LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus. Overview This is the LibriTTS dataset, adapted… See the full description on the dataset page: https://huggingface.co/datasets/mythicinfinity/libritts.

“top 10 rows of Libritts with summary statistics”
“counts grouped by the most common field in Libritts”

和谐历史档案馆数据集 - Banned Historical Archives Datasets 和谐历史档案馆数据集包含已录入 https://banned-historical-archives.github.io 和暂未未录入的原始文件。目录结构 banned-historical-archives.github.io # 已录入该网站的原始数据，不定期从 github 仓库中同步 raw # 原始文件 config # 配置文件 todo # 存放暂未录入网站的文件部分报纸和图片资料存放在单独的仓库: 名称地址状态参考消息 https://huggingface.co/datasets/banned-historical-archives/ckxx 未录入人民日报 https://huggingface.co/datasets/banned-historical-archives/rmrb 已精选重要的文章录入文汇报… See the full description on the dataset page: https://huggingface.co/datasets/banned-historical-archives/banned-historical-archives.

“top 10 rows of Banned Historical Archives with summary statistics”
“counts grouped by the most common field in Banned Historical Archives”

CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography Overview CADS is a robust, fully automated framework for segmenting 167 anatomical structures in Computed Tomography (CT), spanning from head to knee regions across diverse anatomical systems. The framework consists of two main components: CADS-dataset: 22,022 CT volumes with complete annotations for 167 anatomical structures. Most extensive whole-body CT dataset… See the full description on the dataset page: https://huggingface.co/datasets/mrmrx/CADS-dataset.

“top 10 rows of Cads Dataset with summary statistics”
“counts grouped by the most common field in Cads Dataset”

PHYSICAL AI AUTONOMOUS VEHICLES The PhysicalAI-Autonomous-Vehicles dataset provides one of the largest, geographically diverse collections of multi-sensor data empowering AV researchers to build the next generation of Physical AI based end-to-end driving systems. This dataset is ready for commercial/non-commercial AV use per the license agreement. Data Collection Method Automatic/Sensor Labeling Method Automatic/Sensor This dataset has a total of 1700 hours of driving… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles.

“top 10 rows of Physicalai Autonomous Vehicles with summary statistics”
“counts grouped by the most common field in Physicalai Autonomous Vehicles”

OSWorld File Cache This repository serves as a file cache for the OSWorld project, providing reliable and fast access to evaluation files that were previously hosted on Google Drive. Overview OSWorld is a scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems and applications. This cache repository ensures that all evaluation files are consistently accessible… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache.

“top 10 rows of Ubuntu Osworld File Cache with summary statistics”
“counts grouped by the most common field in Ubuntu Osworld File Cache”

Results on MTEB

“top 10 rows of Results with summary statistics”
“counts grouped by the most common field in Results”

Dataset Card for GSM8K Dataset Summary GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning. These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

“top 10 rows of Gsm8K with summary statistics”
“counts grouped by the most common field in Gsm8K”

The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of genera…

“top 10 rows of Xcodeeval with summary statistics”
“counts grouped by the most common field in Xcodeeval”

Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified.

“top 10 rows of Swe Bench Verified with summary statistics”
“counts grouped by the most common field in Swe Bench Verified”

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

“line chart of language_score over date”
“average language_score by dump”

Dataset Card for "medical-qa-shared-task-v1-toy" More Information needed

“histogram of label”
“most common values in ending0”

[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-1k-sample This is a 1k sample of the OpenThoughts-114k dataset. Open synthetic reasoning dataset with high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds =… See the full description on the dataset page: https://huggingface.co/datasets/ryanmarten/OpenThoughts-1k-sample.

“count of records by system”
“distribution of system”

test3

“top 10 rows of Debug with summary statistics”
“counts grouped by the most common field in Debug”

Dataset Card for MMLU Dataset Summary Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.

“top 10 rows of Mmlu with summary statistics”
“counts grouped by the most common field in Mmlu”

Hugging Face dataset: Yarina/Meta_Kaggle_Dataset_Archive_2026-03-12

“scatter CompetitionId vs TagId”
“histogram of CompetitionId”

Dataset Card for GLUE Dataset Summary GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems. Supported Tasks and Leaderboards The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks: ax A manually-curated evaluation dataset for fine-grained analysis of system… See the full description on the dataset page: https://huggingface.co/datasets/nyu-mll/glue.

“top 10 rows of Glue with summary statistics”
“counts grouped by the most common field in Glue”

CommitPackFT is is a 2GB filtered version of CommitPack to contain only high-quality commit messages that resemble natural language instructions.

“top 10 rows of Commitpackft with summary statistics”
“counts grouped by the most common field in Commitpackft”

Dataset Card for "ai2_arc" Dataset Summary A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ai2_arc.

“top 10 rows of Ai2 Arc with summary statistics”
“counts grouped by the most common field in Ai2 Arc”

ニコニコ実況過去ログアーカイブニコニコ実況過去ログアーカイブは、ニコニコ実況のサービス開始から現在までのすべての過去ログコメントを収集したデータセットです。去る2020年12月、ニコニコ実況はニコニコ生放送内の一公式チャンネルとしてリニューアルされました。これに伴い、2009年11月から運用されてきた旧システムは提供終了となり（事実上のサービス終了）、torne や BRAVIA などの家電への対応が軒並み終了する中、当時の生の声が詰まった約11年分の過去ログも同時に失われることとなってしまいました。そこで 5ch の DTV 板の住民が中心となり、旧ニコニコ実況が終了するまでに11年分の全チャンネルの過去ログをアーカイブする計画が立ち上がりました。紆余曲折あり Nekopanda 氏が約11年分のラジオや BS も含めた全チャンネルの過去ログを完璧に取得してくださったおかげで、11年分の過去ログが電子の海に消えていく事態は回避できました。しかし、旧 API が廃止されてしまったため過去ログを API… See the full description on the dataset page: https://huggingface.co/datasets/KakologArchives/KakologArchives.

“top 10 rows of Kakologarchives with summary statistics”
“counts grouped by the most common field in Kakologarchives”

Hugging Face dataset: world-igr-plum/regions

“top 10 rows of Regions with summary statistics”
“counts grouped by the most common field in Regions”

Dataset Summary SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os Dataset Structure We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

“top 10 rows of Swe Bench Pro with summary statistics”
“counts grouped by the most common field in Swe Bench Pro”

Dataset Card for "hellaswag" Dataset Summary HellaSwag: Can a Machine Really Finish Your Sentence? is a new dataset for commonsense NLI. A paper was published at ACL2019. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances default Size of downloaded dataset files: 71.49 MB Size of the generated dataset: 65.32 MB Total amount of disk used: 136.81… See the full description on the dataset page: https://huggingface.co/datasets/Rowan/hellaswag.

“average ind by activity_label”
“top 10 activity_label by total ind”

This dataset was created using LeRobot. Dataset Structure meta/info.json: { "codebase_version": "v2.1", "robot_type": "Franka", "total_episodes": 95600, "total_frames": 27612581, "total_tasks": 0, "total_videos": 286800, "total_chunks": 95, "chunks_size": 1000, "fps": 15, "splits": { "train": "0:95600" }, "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet", "video_path":… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid_1.0.1.

“line chart of observation.state.gripper_position over date”
“average observation.state.gripper_position by language_instruction”

Hugging Face dataset: AquaV/genshin-voices-separated

“count of records by language”
“distribution of language”

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend Changelog Version Details v1.1 Added new data sources: TxT360_BestOfWeb, TxT360_QA, europarl-aligned, and wikipedia_extended. Details of v1.1 Additions TxT360_BestOfWeb: This is a filtered version of the TxT360 dataset, created using the ProX document filtering model. The model is similar to the FineWeb-Edu classifier, but also assigns an additional format score that… See the full description on the dataset page: https://huggingface.co/datasets/LLM360/TxT360.

“count of records by subset”
“distribution of subset”

This dataset was created using LeRobot. DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset One of the biggest open-source dataset for robotics with 27.044,326 frames, 92,223 episodes, 31,308 unique task description in natural language. Ported from Tensorflow Dataset format (2TB) to LeRobotDataset format (400GB) with the help from IPEC-COMMUNITY. Visualization: LeRobot Homepage: Droid Paper: Arxiv License: apache-2.0 Dataset Structure meta/info.json: {… See the full description on the dataset page: https://huggingface.co/datasets/cadene/droid.

“top 10 rows of Droid with summary statistics”
“counts grouped by the most common field in Droid”

GIFT-Eval Pre-training Datasets Pretraining dataset aligned with GIFT-Eval that has 71 univariate and 17 multivariate datasets, spanning seven domains and 13 frequencies, totaling 4.5 million time series and 230 billion data points. Notably this collection of data has no leakage issue with the train/test split and can be used to pretrain foundation models that can be fairly evaluated on GIFT-Eval. 📄 Paper 🖥️ Code 📔 Blog Post 🏎️ Leader Board Ethical Considerations… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/GiftEvalPretrain.

“top 10 rows of Giftevalpretrain with summary statistics”
“counts grouped by the most common field in Giftevalpretrain”

LLaVA-OneVision-1.5 Instruction Data Paper | Code 📌 Introduction This dataset, LLaVA-OneVision-1.5-Instruct, was collected and integrated during the development of LLaVA-OneVision-1.5. LLaVA-OneVision-1.5 is a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. This meticulously curated 22M instruction dataset (LLaVA-OneVision-1.5-Instruct) is part of a comprehensive and… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Instruct-Data.

“top 10 rows of Llava Onevision 1.5 Instruct Data with summary statistics”
“counts grouped by the most common field in Llava Onevision 1.5 Instruct Data”

Dataset Card for "super_glue" Dataset Summary SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure Data Instances axb Size of downloaded dataset files: 0.03 MB Size of… See the full description on the dataset page: https://huggingface.co/datasets/aps/super_glue.

“top 10 rows of Super Glue with summary statistics”
“counts grouped by the most common field in Super Glue”

About this dataset Context The datasets provided include the players data for the Career Mode from FIFA 15 to FIFA 23. The data allows multiple comparisons for the same players across the last 9 versions of the video game. Some ideas of possible analysis: Historical comparison between Messi and Ronaldo (what skill attributes changed the most during time - compared to real-life stats); Ideal budget to create a competitive team (at the level of top n teams in Europe) and… See the full description on the dataset page: https://huggingface.co/datasets/jsulz/FIFA23.

“scatter coach_id vs nationality_id”
“histogram of coach_id”

Dataset Card for "winogrande" Dataset Summary WinoGrande is a new collection of 44k problems, inspired by Winograd Schema Challenge (Levesque, Davis, and Morgenstern 2011), but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning. Supported Tasks and Leaderboards More Information… See the full description on the dataset page: https://huggingface.co/datasets/allenai/winogrande.

“top 10 rows of Winogrande with summary statistics”
“counts grouped by the most common field in Winogrande”

Dataset Card for "imdb" Dataset Summary Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Supported Tasks and Leaderboards More Information Needed Languages More Information Needed Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/imdb.

“histogram of label”
“most common values in text”

Common Corpus Full paper - ICLR 2026 oral Common Corpus is the largest open and permissible licensed text dataset, comprising 2.27 trillion tokens (2,267,302,720,836 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners. Common Corpus differs from existing open datasets in that it is: Truly Open: contains only data that… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/common_corpus.

“top 10 rows of Common Corpus with summary statistics”
“counts grouped by the most common field in Common Corpus”

🚀 LLaVA-One-Vision-1.5-Mid-Training-85M Dataset is being uploaded 🚀 Upload Status All Completed: ImageNet-21k、LAIONCN、DataComp-1B、Zero250M、COYO700M、SA-1B、MINT、Obelics 📜 Cite If you find LLaVA-One-Vision-1.5-Mid-Training-85M useful in your research, please consider to cite the following related papers: @misc{an2025llavaonevision15fullyopenframework, title={LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training}… See the full description on the dataset page: https://huggingface.co/datasets/mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85M.

“top 10 rows of Llava Onevision 1.5 Mid Training 85M with summary statistics”
“counts grouped by the most common field in Llava Onevision 1.5 Mid Training 85M”

GroMo25: Multiview Time-Series Plant Image Dataset for Age Estimation and Leaf Counting Dataset Summary GroMo25 is a multiview, time-series plant image dataset designed for plant age estimation (in days) and leaf counting tasks in precision agriculture. It contains high-quality images of four crop species — Wheat, Okra, Radish, and Mustard — captured over multiple days under controlled conditions. Each plant is photographed from 24 angles across 5 vertical levels per day… See the full description on the dataset page: https://huggingface.co/datasets/MrigLabIITRopar/GroMo25.

“scatter leaf_count vs Age”
“histogram of leaf_count”

Dataset Card for Mostly Basic Python Problems (mbpp) Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released here as part of… See the full description on the dataset page: https://huggingface.co/datasets/google-research-datasets/mbpp.

“average task_id by test_setup_code”
“top 10 test_setup_code by total task_id”

Curated ProteinMPNN training dataset The multi-chain training data for ProteinMPNN Quickstart Usage Install HuggingFace Datasets package Each subset can be loaded into python using the Huggingface datasets library. First, from the command line install the datasets library $ pip install datasets Optionally set the cache directory, e.g. $ HF_HOME=${HOME}/.cache/huggingface/ $ export HF_HOME then, from within python load the datasets library >>> import datasets… See the full description on the dataset page: https://huggingface.co/datasets/RosettaCommons/ProteinMPNN.

“top 10 rows of Proteinmpnn with summary statistics”
“counts grouped by the most common field in Proteinmpnn”

[!NOTE] We have released a paper for OpenThoughts! See our paper here. Open-Thoughts-114k Open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! Inspect the content with rich formatting with Curator Viewer. Available Subsets default subset containing ready-to-train data used to finetune the OpenThinker-7B and OpenThinker-32B models: ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")… See the full description on the dataset page: https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k.

“count of records by system”
“distribution of system”

KAI0 TODO The advantage label will be coming soon. Contents About the Dataset Load the Dataset Download the Dataset Dataset Structure Folder hierarchy Details License and Citation About the Dataset ~134 hours real world scenarios Main Tasks Task_A Single task Initial state: T-shirts are randomly tossed onto the table, presenting random crumpled configurations Manipulation task: Operate the… See the full description on the dataset page: https://huggingface.co/datasets/balatubs123/kumagong.

“line chart of progress_gt over timestamp”
“scatter progress_gt vs stage_progress_gt”

Hugging Face dataset: baber/piqa

“histogram of label”
“most common values in goal”

GroundCUA: Grounding Computer Use Agents on Human Demonstrations 🌐 Website | 📑 Paper | 🤗 Dataset | 🤖 Models GroundCUA Dataset GroundCUA is a large and diverse dataset of real UI screenshots paired with structured annotations for building multimodal computer use agents. It covers 87 software platforms across productivity tools, browsers, creative tools, communication apps, development environments, and system utilities. GroundCUA is designed for research on GUI… See the full description on the dataset page: https://huggingface.co/datasets/ServiceNow/GroundCUA.

“most common values in image”
“length distribution of image”

Further cleaning done. Please look through the dataset and ensure that I didn't miss anything. Update: Confirmed working method for training the model: https://huggingface.co/AlekseyKorshuk/vicuna-7b/discussions/4#64346c08ef6d5abefe42c12c Two choices: Removes instances of "I'm sorry, but": https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json Has instances of "I'm sorry, but":… See the full description on the dataset page: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.

“top 10 rows of Sharegpt Vicuna Unfiltered with summary statistics”
“counts grouped by the most common field in Sharegpt Vicuna Unfiltered”

Hugging Face dataset: genarenadata/backup-leaderboard-data

“top 10 rows of Backup Leaderboard Data with summary statistics”
“counts grouped by the most common field in Backup Leaderboard Data”

arXiv Papers by Subject A reorganised version of the nick007x/arxiv-papers dataset, partitioned by subject code, year, and month for efficient selective access. Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. Motivation The original nick007x/arxiv-papers… See the full description on the dataset page: https://huggingface.co/datasets/permutans/arxiv-papers-by-subject.

“count of records over submission_date”
“count of records by primary_subject”

SWE-Gym contains 2438 instances sourced from 11 Python repos, following SWE-Bench data collection procedure. Get started at project page github.com/SWE-Gym/SWE-Gym

“count of records by repo”
“distribution of repo”

Hugging Face dataset: shi-labs/oneformer_demo

“top 10 rows of Oneformer Demo with summary statistics”
“counts grouped by the most common field in Oneformer Demo”

Dataset Card for Dataset Name This dataset card aims to be a base template for new datasets. It has been generated using this raw template. Dataset Details Dataset Description Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed] Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/bluuebunny/arxiv_metadata_by_year.

“top 10 rows of Arxiv Metadata By Year with summary statistics”
“counts grouped by the most common field in Arxiv Metadata By Year”

Hugging Face dataset: ksolovev/FineNews

“most common values in text”
“length distribution of text”

Hugging Face dataset: nick007x/arxiv-papers

“count of records over submission_date”
“count of records by primary_subject”

Dataset Card for "jat-dataset-tokenized" More Information needed

“top 10 rows of Jat Dataset Tokenized with summary statistics”
“counts grouped by the most common field in Jat Dataset Tokenized”

Hugging Face dataset: SWE-bench/SWE-bench_Multilingual

“top 10 rows of Swe Bench Multilingual with summary statistics”
“counts grouped by the most common field in Swe Bench Multilingual”

Hugging Face dataset: mstz/diva

“top 10 rows of Diva with summary statistics”
“counts grouped by the most common field in Diva”

Compas The Compas dataset for recidivism prediction. Dataset known to have racial bias issues, check this Propublica article on the topic. Configurations and tasks Configuration Task Description encoding Encoding dictionary showing original values of encoded features. two-years-recidividity Binary classification Will the defendant be a violent recidivist? two-years-recidividity-no-race Binary classification As above, but the race feature is… See the full description on the dataset page: https://huggingface.co/datasets/mstz/compas.

“line chart of age over days_before_screening_arrest”
“average age by race”

Bank The Bank dataset from the UCI ML repository. Potential clients are contacted by a bank during a second advertisement campaign. This datasets records the customer, the interaction with the AD campaign, and if they subscribed to a proposed bank plan or not. Configurations and tasks Configuration Task Description encoding Encoding dictionary showing original values of encoded features. subscription Binary classification Has the customer subscribed… See the full description on the dataset page: https://huggingface.co/datasets/mstz/bank.

“line chart of age over month_of_last_contact”
“average age by marital_status”

Breast cancer The Breast cancer dataset from the UCI ML repository. Classify cancerousness of the given cell. Configurations and tasks Configuration Task Description cancer Binary classification Is the cell clump cancerous? Usage from datasets import load_dataset dataset = load_dataset("mstz/breast", "cancer")["train"] Features Name Type Description clump_thickness int8 Thickness of the clump… See the full description on the dataset page: https://huggingface.co/datasets/mstz/breast.

“scatter clump_thickness vs uniformity_of_cell_size”
“histogram of clump_thickness”

German The German dataset from the UCI ML repository. Dataset on loan grants to customers. Configurations and tasks Configuration Task Description encoding Encoding dictionary showing original values of encoded features. loan Binary classification Has the loan request been accepted? Usage from datasets import load_dataset dataset = load_dataset("mstz/german", "loan")["train"] Features Feature Type… See the full description on the dataset page: https://huggingface.co/datasets/mstz/german.

“line chart of checking_account_status over account_life_in_months”
“average checking_account_status by loan_purpose”

HELOC The HELOC dataset from FICO. Each entry in the dataset is a line of credit, typically offered by a bank as a percentage of home equity (the difference between the current market value of a home and its purchase price). The customers in this dataset have requested a credit line in the range of $5,000 - $150,000. The fundamental task is to use the information about the applicant in their credit report to predict whether they will repay their HELOC account within 2 years.… See the full description on the dataset page: https://huggingface.co/datasets/mstz/heloc.

“line chart of estimate_of_risk over months_since_first_trade”
“scatter estimate_of_risk vs average_duration_of_resolution”

Speed dating The Speed dating dataset from OpenML. Configurations and tasks Configuration Task Description dating Binary classification Will the two date? Usage from datasets import load_dataset dataset = load_dataset("mstz/speeddating")["train"] Features Features Type is_dater_male int8 dater_age int8 dated_age int8 age_difference int8 dater_race string dated_race string are_same_race int8… See the full description on the dataset page: https://huggingface.co/datasets/mstz/speeddating.

“line chart of age_difference over is_dater_male”
“average age_difference by are_same_race”

Diamonds The Diamonds dataset from Kaggle. Dataset collecting properties of cut diamonds to determine the cut quality. Configurations and tasks Configuration Task Description encoding Encoding dictionary showing original values of encoded features. cut Multiclass classification Predict the cut quality of the diamond. cut_binary Binary classification Is the cut quality at least very good? Usage from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/mstz/diamonds.

“top 10 rows of Diamonds with summary statistics”
“counts grouped by the most common field in Diamonds”

Student performance The Student performance dataset from Kaggle. Configuration Task Description encoding Encoding dictionary showing original values of encoded features. math Binary classification Has the student passed the math exam? writing Binary classification Has the student passed the writing exam? reading Binary classification Has the student passed the reading exam? Usage from datasets import load_dataset dataset =… See the full description on the dataset page: https://huggingface.co/datasets/mstz/student_performance.

“average parental_level_of_education by ethnicity”
“top 10 ethnicity by total parental_level_of_education”

Heart failure The Heart failure dataset from Kaggle. Predict patient death from earth failure given some personal medical data . Configurations and tasks Configuration Task Description death Binary classification Did the patient die? Usage from datasets import load_dataset dataset = load_dataset("mstz/heart_failure", "death")["train"] Features Feature Type age int8 has_anaemia int8… See the full description on the dataset page: https://huggingface.co/datasets/mstz/heart_failure.

“line chart of age over days_in_study”
“average age by has_anaemia”

NBFI The NBFI dataset from the Kaggle. Client default prediction. Configuration Task Description default Binary classification Has the client defaulted? Usage from datasets import load_dataset dataset = load_dataset("mstz/nbfi")["train"] Features Feature Type income float32 owns_a_car bool owns_a_bike bool has_an_active_loan bool owns_a_house bool nr_children int8 credit float32 loan_annuity float32… See the full description on the dataset page: https://huggingface.co/datasets/mstz/nbfi.

“line chart of income over age_in_days”
“average income by accompanied_by”

DO NOT USE Still working on it. Annealing The Annealing dataset from the UCI ML repository. Dataset Configurations and tasks Configuration Task Description annealing Multiclass classification Usage from datasets import load_dataset dataset = load_dataset("mstz/annealing")["train"]

“top 10 rows of Annealing with summary statistics”
“counts grouped by the most common field in Annealing”

Adult The Toxicity dataset from the UCI ML repository. The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. Configurations and tasks Configuration Task Description toxicity Binary classification Is the molecule toxic? Usage from datasets import load_dataset dataset = load_dataset("mstz/toxicity")["train"]

“scatter MATS3v vs nHBint10”
“histogram of MATS3v”

Isolet The Isolet dataset from the UCI ML repository. Configurations and tasks Configuration Task Description isolet Multiclass classification What letter was uttered? Usage from datasets import load_dataset dataset = load_dataset("mstz/isolet", "isolet")["train"]

“top 10 rows of Isolet with summary statistics”
“counts grouped by the most common field in Isolet”

Acute Inflammation The Acute Inflammation dataset from the UCI ML repository. Predict whether the patient has an acute inflammation. Configurations and tasks Configuration Task Description inflammation Binary classification Does the patient have an acute inflammation? nephritis Binary classification Does the patient have a nephritic pelvis? bladder Binary classification Does the patient have bladder inflammation? nephritis Usage… See the full description on the dataset page: https://huggingface.co/datasets/mstz/acute_inflammation.

“average temperature by has_nausea”
“top 10 has_nausea by total temperature”

Arhythmia The Arrhythmia dataset from the UCI ML repository. Does the patient have arhythmia? If so, what type? Configurations and tasks Configuration Task Description arhytmia Multiclass classification What type of arhythmia does the patient have? has_arhytmia Binary classification Does the patient have arhythmia? Usage from datasets import load_dataset dataset = load_dataset("mstz/arhythmia", "arhythmia")["train"]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/arhythmia.

“average age by is_female”
“top 10 is_female by total age”

Balance scale The Balance scale dataset from the UCI ML repository. Two weights are put on the arms of a scale. Where does the scale tilt? Configurations and tasks Configuration Task Description balance Multiclass classification Where does the scale tilt? is_balanced Binary classification Does the scale tilt? Usage from datasets import load_dataset dataset = load_dataset("mstz/balance_scale", "balance")["train"]… See the full description on the dataset page: https://huggingface.co/datasets/mstz/balance_scale.

“top 10 rows of Balance Scale with summary statistics”
“counts grouped by the most common field in Balance Scale”

Balloons The Balloons dataset from the UCI ML repository. Predict if the given balloon is inflated. Configurations and tasks Configuration Task Description adult_or_stretch Binary classification Balloons are inflated if age == adult or act == stretch. adult_and_stretch Binary classification Balloons are inflated if age == adult and act == stretch. yellow_and_small Binary classification Balloons are inflated if color == yellow and size == small.… See the full description on the dataset page: https://huggingface.co/datasets/mstz/balloons.

“top 10 rows of Balloons with summary statistics”
“counts grouped by the most common field in Balloons”

Blood The Blood Transfusion dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold. Configurations and tasks Configuration Task Description blood Binary classification Has the person donated blood in the past month? Usage from datasets import load_dataset dataset = load_dataset("mstz/blood")["train"]

“line chart of total_donation over months_since_last_donation”
“scatter total_donation vs total_blood_donated_in_cc”

Chess Rock VS Pawn The Chess Rock VS Pawn dataset from the UCI ML repository. Configurations and tasks Configuration Task Description chess Binary classification Can the white piece win? Usage from datasets import load_dataset dataset = load_dataset("mstz/chess_rock_vs_pawn")["train"]

“top 10 rows of Chess Rock Vs Pawn with summary statistics”
“counts grouped by the most common field in Chess Rock Vs Pawn”

Congress The Congress dataset from the UCI ML repository. Congressmen of two different parties vote on a series of bills. Guess the party of each voter on the basis of their votes. Configurations and tasks Configuration Task Description voting Binary classification What's the party of the voter? Usage from datasets import load_dataset dataset = load_dataset("mstz/congress", "voting")["train"]

“top 10 rows of Congress with summary statistics”
“counts grouped by the most common field in Congress”

Fertility The Fertility dataset from the UCI ML repository. Classify fertility abnormalities of patients. Configurations and tasks Configuration Task Description encoding Encoding dictionary fertility Binary classification Does the patient have fertility issues? Usage from datasets import load_dataset dataset = load_dataset("mstz/fertility", "fertility")["train"] Features Feature Type season_of_sampling… See the full description on the dataset page: https://huggingface.co/datasets/mstz/fertility.

“line chart of frequency_of_alcohol_consumption over age_at_time_of_sampling”
“average frequency_of_alcohol_consumption by season_of_sampling”

Haberman The Haberman dataset from the UCI ML repository. Has the patient survived surgery? Configurations and tasks Configuration Task Description sruvival Binary classification Has the patient survived surgery? Usage from datasets import load_dataset dataset = load_dataset("mstz/haberman", "survival")["train"]

“line chart of age over year_of_operation”
“scatter age vs number_of_axillary_nodes”

ILPD The ILPD dataset from the UCI ML repository. Configurations and tasks Configuration Task Description liver Binary classification Does the patient have liver problems? Usage from datasets import load_dataset dataset = load_dataset("mstz/liver")["train"]

“top 10 rows of Liver with summary statistics”
“counts grouped by the most common field in Liver”

Mammography The Mammography dataset from the UCI ML repository. Configurations and tasks Configuration Task Description mammography Binary classification Is the lesion benign? Usage from datasets import load_dataset dataset = load_dataset("mstz/mammography")["train"]

“top 10 rows of Mammography with summary statistics”
“counts grouped by the most common field in Mammography”

Promoters The Promoters dataset from the UCI ML repository. Configurations and tasks Configuration Task Description promoters Binary classification Is this DNA string a promoter? Usage from datasets import load_dataset dataset = load_dataset("mstz/promoters")["train"]

“top 10 rows of Promoters with summary statistics”
“counts grouped by the most common field in Promoters”

Musk The Musk dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold. Configurations and tasks Configuration Task Description musk Binary classification Is the molecule a musk? Usage from datasets import load_dataset dataset = load_dataset("mstz/musk", "musk")["train"]

“scatter ray_0 vs ray_1”
“histogram of ray_0”

Musk The Musk dataset from the UCI ML repository. Census dataset including personal characteristic of a person, and their income threshold. Configurations and tasks Configuration Task Description musk Binary classification Is the molecule a musk? Usage from datasets import load_dataset dataset = load_dataset("mstz/muskV2")["train"]

“top 10 rows of Muskv2 with summary statistics”
“counts grouped by the most common field in Muskv2”

Ozone The Ozone dataset from the UCI ML repository. Configurations and tasks Configuration Task Description 8hr Binary classification Is there an ozone layer? 1hr Binary classification Is there an ozone layer? Usage from datasets import load_dataset dataset = load_dataset("mstz/ozone", "8hr")["train"]

“scatter WSR0 vs WSR1”
“histogram of WSR0”

pima The pima dataset from the UCI ML repository. Predict diabetes of a patient. Configurations and tasks Configuration Task Description pima Binary classification Does the patient have diabetes? Usage from datasets import load_dataset dataset = load_dataset("mstz/pima")["train"]

“scatter number_of_pregnancies vs plasma_glucose_concentration”
“histogram of number_of_pregnancies”

Planning The Planning dataset from the UCI ML repository. Configurations and tasks Configuration Task Description planning Binary classification Is the patient in a planning state? Usage from datasets import load_dataset dataset = load_dataset("mstz/planning")["train"]

“scatter V1 vs V2”
“histogram of V1”

Ozone The Ozone dataset from the UCI ML repository. Configurations and tasks Configuration Task Description spect Binary classification Is there an ozone layer? spectf Binary classification Is there an ozone layer? Usage from datasets import load_dataset dataset = load_dataset("mstz/spect", "spect")["train"]

“average is_emitted by feature_0”
“top 10 feature_0 by total is_emitted”

Australian Credit The Australian Credit from the UCI ML repository. Classification of loan approval. Configurations and tasks Configuration Task Description australian_credit Binary classification Is the loan granted? Usage from datasets import load_dataset dataset = load_dataset("mstz/australian_credit")["train"] Features Target feature changes according to the selected configuration and is always in last position… See the full description on the dataset page: https://huggingface.co/datasets/mstz/australian_credit.

“average feature_2 by feature_1”
“top 10 feature_1 by total feature_2”

TicTacToe The TicTacToe dataset from the UCI ML repository. Configurations and tasks Configuration Task Description tic_tac_toe Binary classification Does the X player win? Usage from datasets import load_dataset dataset = load_dataset("mstz/tic_tac_toe")["train"]

“top 10 rows of Tic Tac Toe with summary statistics”
“counts grouped by the most common field in Tic Tac Toe”

TwoNorm The TwoNorm dataset from the OpenML repository. Configurations and tasks Configuration Task twonorm Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/twonorm")["train"]

“scatter V1 vs V2”
“histogram of V1”

Vertebral Column The Vertebral Column dataset from the UCI ML repository. Configurations and tasks Configuration Task Description abnormal Binary classification Is the spine abnormal? Usage from datasets import load_dataset dataset = load_dataset("mstz/vertebral_column")["train"]

“scatter pelvic_incidence vs pelvic_tilt”
“histogram of pelvic_incidence”

Electricity The Electricity dataset from the OpenML repository. Configurations and tasks Configuration Task Description electricity Binary classification Has the electricity cost gone up? Usage from datasets import load_dataset dataset = load_dataset("mstz/electricity", "electricity")["train"]

“line chart of nswprice over date”
“scatter period vs nswprice”

Pol The Pol dataset from the OpenML repository. Configurations and tasks Configuration Task Description pol Binary classification Has the pol cost gone up? Usage from datasets import load_dataset dataset = load_dataset("mstz/pol", "pol")["train"]

“scatter f1 vs f2”
“histogram of f1”

House16 The House16 dataset from the OpenML repository. Configurations and tasks Configuration Task house16 Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/house16", "house16")["train"]

“scatter P1 vs P5p1”
“histogram of P1”

Phoneme The Phoneme dataset from the OpenML repository. Configurations and tasks Configuration Task phoneme Binary classification Usage from datasets import load_dataset dataset = load_dataset("mstz/phoneme")["train"]

“scatter V1 vs V2”
“histogram of V1”

Contraceptive The Contraceptive dataset from the UCI repository. Does the couple use contraceptives? Configurations and tasks Configuration Task Description contraceptive Binary classification Does the couple use contraceptives? Usage from datasets import load_dataset dataset = load_dataset("mstz/contraceptive", "contraceptive")["train"]

“average age_of_wife by is_wife_muslim”
“top 10 is_wife_muslim by total age_of_wife”

Hayes The Hayes-Roth dataset from the UCI repository. Configurations and tasks Configuration Task Description hayes Multiclass classification Classify hayes type. hayes_1 Binary classification Is this instance of class 1? hayes_2 Binary classification Is this instance of class 2? hayes_3 Binary classification Is this instance of class 3? Usage from datasets import load_dataset dataset = load_dataset("mstz/hayes"… See the full description on the dataset page: https://huggingface.co/datasets/mstz/hayes_roth.

“top 10 rows of Hayes Roth with summary statistics”
“counts grouped by the most common field in Hayes Roth”

Iris The Iris dataset from the UCI repository. Configurations and tasks Configuration Task Description iris Multiclass classification Classify iris type. setosa Binary classification Is this a iris-setosa? versicolor Binary classification Is this a iris-versicolor? virginica Binary classification Is this a iris-virginica? Usage from datasets import load_dataset dataset = load_dataset("mstz/iris", "iris")["train"]

“top 10 rows of Iris with summary statistics”
“counts grouped by the most common field in Iris”

Lrs The Lrs dataset from the UCI repository. Configurations and tasks Configuration Task Description lrs Multiclass classification Classify lrs type. lrs_0 Binary classification Is this instance of class 0? lrs_1 Binary classification Is this instance of class 1? lrs_2 Binary classification Is this instance of class 2? lrs_3 Binary classification Is this instance of class 3? lrs_4 Binary classification Is this instance of class 4? lrs_5… See the full description on the dataset page: https://huggingface.co/datasets/mstz/lrs.

“scatter right_ascension vs declination”
“histogram of right_ascension”

Splice The Splice dataset from the UCI repository. Configurations and tasks Configuration Task splice Multiclass classification splice_EI Binary classification splice_IE Binary classification splice_N Binary classification

“top 10 rows of Splice with summary statistics”
“counts grouped by the most common field in Splice”

PageBlocks The PageBlocks dataset from the UCI repository. How many transitions does the page block have? Configurations and tasks Configuration Task page_blocks Multiclass classification page_blocks_binary Binary classification

“scatter height vs lenght”
“histogram of height”

Post Operative The PostOperative dataset from the UCI repository. Should the patient be discharged from the hospital, go to the ground floor, or to the ICU? Configurations and tasks Configuration Task post_operative Multiclass classification. post_operative_binary Binary classification.

“average internal_temperature by is_surface_temperature_stable”
“top 10 is_surface_temperature_stable by total internal_temperature”

Post Operative The Seeds dataset from the UCI repository. Configurations and tasks Configuration Task Description seeds Multiclass classification. seeds_0 Binary classification. Is the seed of class 0? seeds_1 Binary classification. Is the seed of class 1? seeds_2 Binary classification. Is the seed of class 2?

“scatter area vs perimeter”
“histogram of area”

Hugging Face dataset: mstz/segment

“scatter region_centroid_col vs region_centroid_row”
“histogram of value_mean”

Landsat The Landsat dataset from the UCI repository. Configurations and tasks Configuration Task Description landsat Multiclass classification. landsat_0 Binary classification. Is the image of class 0? landsat_1 Binary classification. Is the image of class 1? landsat_2 Binary classification. Is the image of class 2? landsat_3 Binary classification. Is the image of class 3? landsat_4 Binary classification. Is the image of class 4? landsat_5… See the full description on the dataset page: https://huggingface.co/datasets/mstz/landsat.

“top 10 rows of Landsat with summary statistics”
“counts grouped by the most common field in Landsat”

Shuttle The Shuttle dataset from the UCI repository. Configurations and tasks Configuration Task Description shuttle Multiclass classification. shuttle_0 Binary classification. Is the image of class 0? shuttle_1 Binary classification. Is the image of class 1? shuttle_2 Binary classification. Is the image of class 2? shuttle_3 Binary classification. Is the image of class 3? shuttle_4 Binary classification. Is the image of class 4? shuttle_5… See the full description on the dataset page: https://huggingface.co/datasets/mstz/shuttle.

“line chart of rad_flow over time”
“scatter rad_flow vs fpv_close”

Landsat The Steel Plates dataset from the UCI repository. Configurations and tasks Configuration Task Description steel_plates Multiclass classification. steel_plates_0 Binary classification. Is the input of class 0? steel_plates_1 Binary classification. Is the input of class 1? steel_plates_2 Binary classification. Is the input of class 2? steel_plates_3 Binary classification. Is the input of class 3? steel_plates_4 Binary classification. Is… See the full description on the dataset page: https://huggingface.co/datasets/mstz/steel_plates.

“scatter x_minimum vs x_maximum”
“histogram of x_minimum”

WallFollowing The WallFollowing dataset from the UCI repository. Configurations and tasks Configuration Task Description wall_following Multiclass classification. wall_following_0 Binary classification. Is the instance of class 0? wall_following_1 Binary classification. Is the instance of class 1? wall_following_2 Binary classification. Is the instance of class 2? wall_following_3 Binary classification. Is the instance of class 3?

“scatter US1 vs US2”
“histogram of US1”

WaveformNoiseV1 The WaveformNoiseV1 dataset from the UCI repository. Configurations and tasks Configuration Task Description waveformnoiseV1 Multiclass classification. waveformnoiseV1_0 Binary classification. Is the image of class 0? waveformnoiseV1_1 Binary classification. Is the image of class 1? waveformnoiseV1_2 Binary classification. Is the image of class 2?

“top 10 rows of Waveform Noise V1 with summary statistics”
“counts grouped by the most common field in Waveform Noise V1”

Wine Origin The Wine Origin dataset from the UCI repository. Configurations and tasks Configuration Task Description wine_origin Multiclass classification. wine_origin_0 Binary classification. Is the instance of class 0? wine_origin_1 Binary classification. Is the instance of class 1? wine_origin_2 Binary classification. Is the instance of class 2?

“scatter malic_acid vs ash”
“histogram of malic_acid”

Arcene The Arcene dataset from the UCI repository.

“top 10 rows of Arcene with summary statistics”
“counts grouped by the most common field in Arcene”

Ipums The Ipums dataset from the UCI repository.

“top 10 rows of Ipums with summary statistics”
“counts grouped by the most common field in Ipums”

Dexter The Dexter dataset from the UCI repository. Configurations and tasks Configuration Task dexter Binary classification.

“top 10 rows of Dexter with summary statistics”
“counts grouped by the most common field in Dexter”

Gisette The Gisette dataset from the UCI repository. Configurations and tasks Configuration Task Description gisette Binary classification.

“top 10 rows of Gisette with summary statistics”
“counts grouped by the most common field in Gisette”

Golf The Golf dataset. Is it a good day to play golf? Configurations and tasks Configuration Task golf Binary classification.

“top 10 rows of Golf with summary statistics”
“counts grouped by the most common field in Golf”

Hypo The Hypo dataset. Configurations and tasks Configuration Task Description hypo Multiclass classification. What kind of hypothyroidism does the patient have? has_hypo Binary classification. Does the patient hypothyroidism does the patient have?

“average age by sex”
“top 10 sex by total age”

Optdigits The Optdigits dataset from the UCI repository. Configurations and tasks Configuration Task Description optdigits Multiclass classification. 0 Binary classification. Is this a 0? 1 Binary classification. Is this a 1? 2 Binary classification. Is this a 2? ... Binary classification. ...

“top 10 rows of Optdigits with summary statistics”
“counts grouped by the most common field in Optdigits”

P53 The P53 dataset from the UCI repository. Configurations and tasks Configuration Task Description p53 Binary classification.

“top 10 rows of P53 with summary statistics”
“counts grouped by the most common field in P53”

Pums The Pums dataset from the UCI repository. U.S.A. Census dataset, classify the income of the individual. Configurations and tasks Configuration Task pums Binary classification.

“top 10 rows of Pums with summary statistics”
“counts grouped by the most common field in Pums”

Soybean The Soybean dataset from the UCI repository. Classify the type of soybean. Configurations and tasks Configuration Task Description soybean Binary classification. Classify soybean type. diaporthe_stem_canker Binary classification Is this instance of class diaporthe_stem_canker? charcoal_rot Binary classification Is this instance of class charcoal_rot? rhizoctonia_root_rot Binary classification Is this instance of class rhizoctonia_root_rot?… See the full description on the dataset page: https://huggingface.co/datasets/mstz/soybean.

“top 10 rows of Soybean with summary statistics”
“counts grouped by the most common field in Soybean”

Victorian authorship The Victorian authorship dataset. Which Victorian author wrote the given text? Configurations and tasks Configuration Task Description authorship Classification Which Victorian author wrote the given text? Usage from datasets import load_dataset dataset = load_dataset("mstz/victorian_authorship", "authorship")["train"] Features Feature Type text [string] Citation… See the full description on the dataset page: https://huggingface.co/datasets/mstz/victorian_authorship.

“top 10 rows of Victorian Authorship with summary statistics”
“counts grouped by the most common field in Victorian Authorship”

Abalone The Medieval Latin authorship attribution dataset. Which Victorian author wrote the given text? Note: Only epistolas are included in this version of the dataset. Configurations and tasks Configuration Task Description authorship Classification Which Latin author wrote the given text? Usage from datasets import load_dataset dataset = load_dataset("mstz/medieval_latin", "authorship")["train"] Features… See the full description on the dataset page: https://huggingface.co/datasets/mstz/medieval_latin.

“top 10 rows of Medieval Latin with summary statistics”
“counts grouped by the most common field in Medieval Latin”

Hugging Face dataset: mstz/fairbelief

“top 10 rows of Fairbelief with summary statistics”
“counts grouped by the most common field in Fairbelief”

Hugging Face dataset: mstz/cifar10OD

“top 10 rows of Cifar10Od with summary statistics”
“counts grouped by the most common field in Cifar10Od”

This is the sentiment analysis dataset based on IMDB reviews initially released by Stanford University. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/imdb.

“count of records over sentiment”
“most common values in review”

Student Alcohol Consumption Dataset A dataset on social, gender and study data from secondary school students. Following was retrieved from UCI machine learning repository. Context: The data were obtained in a survey of students math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students. You can use it for some EDA or try to predict students final grade. Content: Attributes for both… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/student-alcohol-consumption.

“line chart of age over traveltime”
“average age by school”

A Waiter's Tips The following description was retrieved from Kaggle page. Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/tips.

“line chart of total_bill over day”
“average total_bill by sex”

Customer churn prediction dataset of a fictional telecommunication company made by IBM Sample Datasets. Context Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs. Content Each row represents a customer, each column contains customer’s attributes described on the column metadata. The data set includes information about: Customers who left within the last month: the column is called Churn Services that each customer… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/churn-prediction.

“line chart of SeniorCitizen over MonthlyCharges”
“average SeniorCitizen by gender”

Dataset Summary Dataset recording various measurements of 7 different species of fish at a fish market. Predictive models can be used to predict weight, species, etc. Feature Descriptions Species - Species name of fish Weight - Weight of fish in grams Length1 - Vertical length in cm Length2 - Diagonal length in cm Length3 - Cross length in cm Height - Height in cm Width - Width in cm Acknowledgments Dataset created by Aung Pyae, and found on… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/Fish.

“average Weight by Species”
“top 10 Species by total Weight”

Dataset Card for "modelnet40" More Information needed

“top 10 rows of Modelnet40 with summary statistics”
“counts grouped by the most common field in Modelnet40”

Dataset Card for "esol" More Information needed

“top 10 rows of Esol with summary statistics”
“counts grouped by the most common field in Esol”

Dataset Card for "freesolv" More Information needed

“top 10 rows of Freesolv with summary statistics”
“counts grouped by the most common field in Freesolv”

Dataset Card for "lipop" More Information needed

“top 10 rows of Lipop with summary statistics”
“counts grouped by the most common field in Lipop”

Dataset Card for "bace" More Information needed

“top 10 rows of Bace with summary statistics”
“counts grouped by the most common field in Bace”

Dataset Card for "bbbp" More Information needed

“top 10 rows of Bbbp with summary statistics”
“counts grouped by the most common field in Bbbp”

Dataset Card for "hiv" More Information needed

“top 10 rows of Hiv with summary statistics”
“counts grouped by the most common field in Hiv”

Dataset Card for "ecg" More Information needed

“histogram of label”

Dataset Card for "emg" More Information needed

“histogram of label”

Dataset Card for "shapenet55" More Information needed

“top 10 rows of Shapenet55 with summary statistics”
“counts grouped by the most common field in Shapenet55”

Dataset Card for "modelnet40-2048" More Information needed

“histogram of label”

Dataset Card for "scanobjectnn" More Information needed

“top 10 rows of Scanobjectnn with summary statistics”
“counts grouped by the most common field in Scanobjectnn”

Dataset Card for "epsilon" More Information needed

“histogram of label”

Dataset Card for "epsilon-normalized" More Information needed

“histogram of label”

Dataset Card for "covtype" More Information needed

“histogram of label”

Dataset Card for "ca_housing" More Information needed

“histogram of label”

Dataset Card for "year" More Information needed

“histogram of label”

Dataset Card for "aloi" More Information needed

“histogram of label”

Dataset Card for "trivia_qa" More Information needed

“most common values in question”
“length distribution of question”

Dataset Card for "natural_questions" More Information needed

“most common values in question”
“length distribution of question”

Dataset Card for "truthful_qa" More Information needed

“top 10 rows of Truthful Qa with summary statistics”
“counts grouped by the most common field in Truthful Qa”

Hugging Face dataset: jxie/celeba-hq-vit-s-vqgan-virtual-share-causal-128-embed_dim32-ce0-cew1-continuous

“top 10 rows of Celeba Hq Vit S Vqgan Virtual Share Causal 128 Embed Dim32 Ce0 Cew1 Continuous with summary statistics”
“counts grouped by the most common field in Celeba Hq Vit S Vqgan Virtual Share Causal 128 Embed Dim32 Ce0 Cew1 Continuous”

Hugging Face dataset: jxie/models

“top 10 rows of Models with summary statistics”
“counts grouped by the most common field in Models”

Hugging Face dataset: jxie/UCF-101-pose-captions

“count of records by class”
“distribution of class”

Hugging Face dataset: jxie/video-testing

“top 10 rows of Video Testing with summary statistics”
“counts grouped by the most common field in Video Testing”

Hugging Face dataset: jxie/UCF-101

“top 10 rows of Ucf 101 with summary statistics”
“counts grouped by the most common field in Ucf 101”

Hugging Face dataset: jxie/video-testing_bootstapir_checkpoint_v2

“count of records by video”
“distribution of video”

Hugging Face dataset: jxie/UCF-101-bootstapir_checkpoint_v2

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/UCF_101-bootstapir_checkpoint_v2-with_visibles

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 With Visibles with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2 With Visibles”

Hugging Face dataset: jxie/UCF_101-bootstapir_checkpoint_v2-with_visibles-tiny

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 With Visibles Tiny with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2 With Visibles Tiny”

Hugging Face dataset: jxie/DTDB

“top 10 rows of Dtdb with summary statistics”
“counts grouped by the most common field in Dtdb”

Hugging Face dataset: jxie/DTDB-bootstapir_checkpoint_v2-with_visibles

“top 10 rows of Dtdb Bootstapir Checkpoint V2 With Visibles with summary statistics”
“counts grouped by the most common field in Dtdb Bootstapir Checkpoint V2 With Visibles”

Hugging Face dataset: jxie/motion_x_video

“top 10 rows of Motion X Video with summary statistics”
“counts grouped by the most common field in Motion X Video”

Hugging Face dataset: jxie/UCF-101-bootstapir_checkpoint_v2-still_edge_crop-discard_camera_motion

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 Still Edge Crop Discard Camera Motion with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2 Still Edge Crop Discard Camera Motion”

Hugging Face dataset: jxie/cleverer

“top 10 rows of Cleverer with summary statistics”
“counts grouped by the most common field in Cleverer”

Hugging Face dataset: jxie/cleverer-bootstapir_checkpoint_v2

“top 10 rows of Cleverer Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Cleverer Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/cleverer-bootstapir_checkpoint_v2-crop_still_edges-discard_camera-motion

“top 10 rows of Cleverer Bootstapir Checkpoint V2 Crop Still Edges Discard Camera Motion with summary statistics”
“counts grouped by the most common field in Cleverer Bootstapir Checkpoint V2 Crop Still Edges Discard Camera Motion”

Hugging Face dataset: jxie/motion_x-bootstapir_checkpoint_v2-crop_still_edges-discard_camera-motion

“top 10 rows of Motion X Bootstapir Checkpoint V2 Crop Still Edges Discard Camera Motion with summary statistics”
“counts grouped by the most common field in Motion X Bootstapir Checkpoint V2 Crop Still Edges Discard Camera Motion”

Hugging Face dataset: jxie/UCF_101-evaluation-testing

“top 10 rows of Ucf 101 Evaluation Testing with summary statistics”
“counts grouped by the most common field in Ucf 101 Evaluation Testing”

Hugging Face dataset: jxie/tapvid-davis

“top 10 rows of Tapvid Davis with summary statistics”
“counts grouped by the most common field in Tapvid Davis”

Hugging Face dataset: jxie/tapvid-davis-no_camera_motion

“top 10 rows of Tapvid Davis No Camera Motion with summary statistics”
“counts grouped by the most common field in Tapvid Davis No Camera Motion”

Hugging Face dataset: jxie/UCF_101-bootstapir_checkpoint_v2-crop_still_edges

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 Crop Still Edges with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2 Crop Still Edges”

Hugging Face dataset: jxie/droid

“top 10 rows of Droid with summary statistics”
“counts grouped by the most common field in Droid”

Hugging Face dataset: jxie/UCF_101-100_samples-100_points-mwt0.0

“top 10 rows of Ucf 101 100 Samples 100 Points Mwt0.0 with summary statistics”
“counts grouped by the most common field in Ucf 101 100 Samples 100 Points Mwt0.0”

Hugging Face dataset: jxie/UCF_101-100_samples-100_points-mwt0.5

“top 10 rows of Ucf 101 100 Samples 100 Points Mwt0.5 with summary statistics”
“counts grouped by the most common field in Ucf 101 100 Samples 100 Points Mwt0.5”

Hugging Face dataset: jxie/language_table

“top 10 rows of Language Table with summary statistics”
“counts grouped by the most common field in Language Table”

Hugging Face dataset: jxie/language_table-bootstapir_checkpoint_v2

“average movements by camera_motion”
“top 10 camera_motion by total movements”

Hugging Face dataset: jxie/motion_x_video-bootstapir_checkpoint_v2-crop_still_edges

“top 10 rows of Motion X Video Bootstapir Checkpoint V2 Crop Still Edges with summary statistics”
“counts grouped by the most common field in Motion X Video Bootstapir Checkpoint V2 Crop Still Edges”

Hugging Face dataset: jxie/UCF_101-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Ucf 101 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Ucf 101 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/bridge_data_v2

“top 10 rows of Bridge Data V2 with summary statistics”
“counts grouped by the most common field in Bridge Data V2”

Hugging Face dataset: jxie/bridge_data_v2-bootstapir_checkpoint_v2

“top 10 rows of Bridge Data V2 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Bridge Data V2 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/UCF_101_testing

“average movements by camera_motion”
“top 10 camera_motion by total movements”

Hugging Face dataset: jxie/motion_x_video-testing

“top 10 rows of Motion X Video Testing with summary statistics”
“counts grouped by the most common field in Motion X Video Testing”

Hugging Face dataset: jxie/kinetics400-bootstapir_checkpoint_v2-crop_still_edges

“top 10 rows of Kinetics400 Bootstapir Checkpoint V2 Crop Still Edges with summary statistics”
“counts grouped by the most common field in Kinetics400 Bootstapir Checkpoint V2 Crop Still Edges”

Hugging Face dataset: jxie/droid_exterior-bootstapir_checkpoint_v2

“top 10 rows of Droid Exterior Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Droid Exterior Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/mimicgen

“top 10 rows of Mimicgen with summary statistics”
“counts grouped by the most common field in Mimicgen”

Hugging Face dataset: jxie/UCF_101-100_samples-100_points-50_movement_min-mwt0.5-with_movements

“top 10 rows of Ucf 101 100 Samples 100 Points 50 Movement Min Mwt0.5 With Movements with summary statistics”
“counts grouped by the most common field in Ucf 101 100 Samples 100 Points 50 Movement Min Mwt0.5 With Movements”

Hugging Face dataset: jxie/hmdb51

“top 10 rows of Hmdb51 with summary statistics”
“counts grouped by the most common field in Hmdb51”

Hugging Face dataset: jxie/hmdb51-bootstapir_checkpoint_v2-crop_still_edges

“top 10 rows of Hmdb51 Bootstapir Checkpoint V2 Crop Still Edges with summary statistics”
“counts grouped by the most common field in Hmdb51 Bootstapir Checkpoint V2 Crop Still Edges”

Hugging Face dataset: jxie/hmdb51-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Hmdb51 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Hmdb51 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/maniskill

“top 10 rows of Maniskill with summary statistics”
“counts grouped by the most common field in Maniskill”

Hugging Face dataset: jxie/language_table-first_100_samples-100_points-mwt0.5

“top 10 rows of Language Table First 100 Samples 100 Points Mwt0.5 with summary statistics”
“counts grouped by the most common field in Language Table First 100 Samples 100 Points Mwt0.5”

Hugging Face dataset: jxie/calvin_1_chained

“top 10 rows of Calvin 1 Chained with summary statistics”
“counts grouped by the most common field in Calvin 1 Chained”

Hugging Face dataset: jxie/calvin_1_chained-bootstapir_checkpoint_v2

“top 10 rows of Calvin 1 Chained Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Calvin 1 Chained Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/calvin-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Calvin 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Calvin 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/mimicgen-bootstapir_checkpoint_v2

“top 10 rows of Mimicgen Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Mimicgen Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/maniskill-bootstapir_checkpoint_v2

“top 10 rows of Maniskill Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Maniskill Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/something_something_v2

“top 10 rows of Something Something V2 with summary statistics”
“counts grouped by the most common field in Something Something V2”

Hugging Face dataset: jxie/calvin_1_chained_testing

“average movements by camera_motion”
“top 10 camera_motion by total movements”

Hugging Face dataset: jxie/webvid_10m

“top 10 rows of Webvid 10M with summary statistics”
“counts grouped by the most common field in Webvid 10M”

Hugging Face dataset: jxie/webvid_10m-bootstapir_checkpoint_v2

“top 10 rows of Webvid 10M Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/UCF_101-bootstapir_checkpoint_v2-crop_still_edges-video_llama3_7b_captioned

“top 10 rows of Ucf 101 Bootstapir Checkpoint V2 Crop Still Edges Video Llama3 7B Captioned with summary statistics”
“counts grouped by the most common field in Ucf 101 Bootstapir Checkpoint V2 Crop Still Edges Video Llama3 7B Captioned”

Hugging Face dataset: jxie/hmdb51-bootstapir_checkpoint_v2

“top 10 rows of Hmdb51 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Hmdb51 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/kinetics400-100_points-50_movement_min-mwt0.5

“top 10 rows of Kinetics400 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Kinetics400 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/calvin_abc_d-100_points-50_movement_min-mwt0.5

“top 10 rows of Calvin Abc D 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Calvin Abc D 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/UCF_101-100_points-mwt0.5

“top 10 rows of Ucf 101 100 Points Mwt0.5 with summary statistics”
“counts grouped by the most common field in Ucf 101 100 Points Mwt0.5”

Hugging Face dataset: jxie/language_table_sim

“top 10 rows of Language Table Sim with summary statistics”
“counts grouped by the most common field in Language Table Sim”

Hugging Face dataset: jxie/language_table_sim-bootstapir_checkpoint_v2

“average movements by camera_motion”
“top 10 camera_motion by total movements”

Hugging Face dataset: jxie/kuka

“top 10 rows of Kuka with summary statistics”
“counts grouped by the most common field in Kuka”

Hugging Face dataset: jxie/bc_z

“top 10 rows of Bc Z with summary statistics”
“counts grouped by the most common field in Bc Z”

Hugging Face dataset: jxie/fractal

“top 10 rows of Fractal with summary statistics”
“counts grouped by the most common field in Fractal”

Hugging Face dataset: jxie/bc_z-bootstapir_checkpoint_v2

“top 10 rows of Bc Z Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Bc Z Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/webvid_10m-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Webvid 10M 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Webvid 10M 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/webvid_10m_part_10

“top 10 rows of Webvid 10M Part 10 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 10”

Hugging Face dataset: jxie/webvid_10m_part_4

“top 10 rows of Webvid 10M Part 4 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 4”

Hugging Face dataset: jxie/webvid_10m_part_7

“top 10 rows of Webvid 10M Part 7 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 7”

Hugging Face dataset: jxie/webvid_10m_part_6

“top 10 rows of Webvid 10M Part 6 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 6”

Hugging Face dataset: jxie/webvid_10m_part_8

“top 10 rows of Webvid 10M Part 8 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 8”

Hugging Face dataset: jxie/webvid_10m_part_9

“top 10 rows of Webvid 10M Part 9 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 9”

Hugging Face dataset: jxie/webvid_10m_part_5

“top 10 rows of Webvid 10M Part 5 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 5”

Hugging Face dataset: jxie/webvid_10m_part_1

“top 10 rows of Webvid 10M Part 1 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 1”

Hugging Face dataset: jxie/webvid_10m_part_0

“top 10 rows of Webvid 10M Part 0 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 0”

Hugging Face dataset: jxie/webvid_10m_part_3

“top 10 rows of Webvid 10M Part 3 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 3”

Hugging Face dataset: jxie/webvid_10m_part_2

“top 10 rows of Webvid 10M Part 2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 2”

Hugging Face dataset: jxie/fractal-bootstapir_checkpoint_v2

“top 10 rows of Fractal Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Fractal Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/webvid_10m_part_0-bootstapir_checkpoint_v2

“top 10 rows of Webvid 10M Part 0 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 0 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/webvid_10m_part_1-bootstapir_checkpoint_v2

“top 10 rows of Webvid 10M Part 1 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 1 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/webvid_10m_part_3-bootstapir_checkpoint_v2

“top 10 rows of Webvid 10M Part 3 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 3 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/webvid_10m_part_2-bootstapir_checkpoint_v2

“top 10 rows of Webvid 10M Part 2 Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Webvid 10M Part 2 Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/None

“top 10 rows of None with summary statistics”
“counts grouped by the most common field in None”

Hugging Face dataset: jxie/behave_dataset

“top 10 rows of Behave Dataset with summary statistics”
“counts grouped by the most common field in Behave Dataset”

Hugging Face dataset: jxie/omomo_video

“top 10 rows of Omomo Video with summary statistics”
“counts grouped by the most common field in Omomo Video”

Hugging Face dataset: jxie/omomo_video-bootstapir_checkpoint_v2

“top 10 rows of Omomo Video Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Omomo Video Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/omomo_video-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Omomo Video 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Omomo Video 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/omomo_video-crop_still_edges-bootstapir_checkpoint_v2

“top 10 rows of Omomo Video Crop Still Edges Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Omomo Video Crop Still Edges Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/omomo_video-crop_still_edges-100_samples-100_points-50_movement_min-mwt0.5

“top 10 rows of Omomo Video Crop Still Edges 100 Samples 100 Points 50 Movement Min Mwt0.5 with summary statistics”
“counts grouped by the most common field in Omomo Video Crop Still Edges 100 Samples 100 Points 50 Movement Min Mwt0.5”

Hugging Face dataset: jxie/motion_x_rendered

“top 10 rows of Motion X Rendered with summary statistics”
“counts grouped by the most common field in Motion X Rendered”

Hugging Face dataset: jxie/motion_x_rendered-bootstapir_checkpoint_v2

“top 10 rows of Motion X Rendered Bootstapir Checkpoint V2 with summary statistics”
“counts grouped by the most common field in Motion X Rendered Bootstapir Checkpoint V2”

Hugging Face dataset: jxie/anti-lgbt-cyberbullying

“histogram of label”
“most common values in text”

Hugging Face dataset: jxie/calvin-lmdb-abc_d

“top 10 rows of Calvin Lmdb Abc D with summary statistics”
“counts grouped by the most common field in Calvin Lmdb Abc D”

Hugging Face dataset: jxie/calvin-lmdb-abcd_d-10_percent

“top 10 rows of Calvin Lmdb Abcd D 10 Percent with summary statistics”
“counts grouped by the most common field in Calvin Lmdb Abcd D 10 Percent”

SEPAL: Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning This dataset contains the Mini YAGO3 knowledge graph and the downstream tasks used in the SEPAL paper (https://arxiv.org/pdf/2507.00965). SEPAL is a knowledge graph embedding method for very large knowledge graphs. It is designed to produce good embeddings for downstream regression and classification tasks. The code for SEPAL can be found at https://github.com/soda-inria/sepal… See the full description on the dataset page: https://huggingface.co/datasets/inria-soda/sepal-datasets.

“histogram of target”
“most common values in raw_entity”

Carshare

Election

Tips

Gapminder

Iris

Medals_long

Stocks

Experiments

wind

Hormuz Shipping Crisis

Energy Transition Mix

Ecommerce Funnel

City Air Quality

Iris

Adult Census Income

Breast Cancer Wisconsin

Heart

Adult

Wine

Titanic

Abalone

Car

Mushroom

Glass

Demo1

Custom Squad

Reddit Finance 43 250K

Finance Tasks

Financial Classification

Ml Arxiv Papers

Ai Arxiv Chunked

Github Code Clean

Github Jupyter Code To Text

The Stack Smol

Apps

World Cities Geo

Channel Metadata

Youtube Transcriptions

Codealpaca 20K

Medical Meadow Medqa

Wiki Medical Terms

Medical Qa Datasets

Healthcare Data

Chest Xray Classification

Climate Sentiment

Climate Detection

Environmental Claims

Imdb

Ag News

Amazon Polarity

Banking77

Glue

Glue

Glue

Glue

Super Glue

Squad

Squad V2

Wmt16

Conll2003

Rotten Tomatoes

Yelp Polarity

Trec

Dbpedia 14

Emotion

Tweet Eval

Tweet Eval

Tweet Eval

Go Emotions

Snli

Multi Nli

Hellaswag

Piqa

Winogrande

Openai Humaneval

Mbpp

Lambada

Mnist

Cifar10

Cifar100