Awesome DataSets Collections

Collection | 数据集合

Search Index | 索引

  • Google Dataset Search: A new search service to find data from sciences, government, some news organizations.

  • Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management

  • Open Data Inception: 2600+ Open Data Portals Around the World

  • 天池数据集: 多领域的用于科学研究与实验的数据集合

Repositories | 资源

NLP & Text DataSets | 文本数据

  • DataSets : Datasets and evaluation metrics for natural language processing and more. Compatible with NumPy, Pandas, PyTorch and TensorFlow.


  • 20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.


  • Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.



  • Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.

Chinese | 中文文本

  • 2016-THUCTC: 清华大学新闻数据集

  • chinese-xinhua: 中华新华字典数据库和 API。收录包括 14032 条歇后语,16142 个汉字,264434 个词语,31648 个成语。

  • chinese-poetry: 最全的中华古典文集数据库, 包含 5.5 万首唐诗、26 万首宋诗和 2.1 万首宋词. 唐宋两朝近 1.4 万古诗人, 和两宋时期 1.5K 词人. 数据来源于互联网。

  • 2019-ChineseGLUE : Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard.

Image DataSets | 图片数据

  • fashion-mnist : Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.

  • facets : The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive.

  • Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共 173MB

  • Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB

  • NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total

  • One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB

  • Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB

  • Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

  • 2023-MultimodalC4 : MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.


  • NSFW Data Scrapper : Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier.


  • im2latex-100k : A prebuilt dataset for OpenAI’s task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets.

Voice & Media & Video

领域数据 | Domain

Social Networks | 社交网络

  • Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.

  • SNAP: Stanford Large Network Dataset Collection

  • MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web.

  • Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations.

Driving Data | 驾驶数据

LBS | 地理位置

Time Series

  • Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.

Business DataSets

Financial | 金融证券

  • Tushare: 交易类数据提供股票的交易行情数据,通过简单的接口调用可获取相应的 DataFrame 格式数据。

Sports | 体育

  • Football Strategy:Thousands of scenarios to make the best coaching decisions. 共 876KB

  • Horses for Course:Horse-racing data for predicting race results. 共 19MB

  • NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.

Medicines | 医药

Foods | 饮食

  • Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3 个文件,共 343KB。

  • malicious-urls: 数十万条级别的 URL 以及其是否 Malicious 标签.

  • MovieLens:海量的关于电影影评数据

Governments | 政务