Awesome-DataSets

Awesome DataSets Collections

Collection |数据集合

Search Index |索引

  • Google Dataset Search: A new search service to find data from sciences, government, some news organizations.

  • Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management

  • Open Data Inception: 2600+ Open Data Portals Around the World

  • 天池数据集:多领域的用于科学研究与实验的数据集合

Repositories |资源

NLP & Text DataSets |文本数据

  • DataSets : Datasets and evaluation metrics for natural language processing and more. Compatible with NumPy, Pandas, PyTorch and TensorFlow.

News

  • 20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.

Wiki

  • Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.

Tweets

Comments/Reviews

  • Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.

Chinese |中文文本

  • 2016-THUCTC:清华大学新闻数据集

  • chinese-xinhua:中华新华字典数据库和API。收录包括14032条歇后语,16142个汉字,264434个词语,31648个成语。

  • chinese-poetry:最全的中华古典文集数据库,包含5.5万首唐诗、26万首宋诗和2.1万首宋词.唐宋两朝近1.4万古诗人,和两宋时期1.5K词人.数据来源于互联网。

  • 2019-ChineseGLUE : Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard.

Image DataSets |图片数据

  • fashion-mnist : Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.

  • facets : The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive.

  • Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets.173MB

  • Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB

  • NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total

  • One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB

  • Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB

  • Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images

  • 2023-MultimodalC4 : MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.

Adults

  • NSFW Data Scrapper : Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier.

OCR

  • im2latex-100k : A prebuilt dataset for OpenAI’s task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets.

Voice & Media & Video

领域数据| Domain

Social Networks |社交网络

  • Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.

  • SNAP: Stanford Large Network Dataset Collection

  • MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web.

  • Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations.

Driving Data |驾驶数据

LBS |地理位置

Time Series

  • Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.

Business DataSets

Financial |金融证券

  • Tushare:交易类数据提供股票的交易行情数据,通过简单的接口调用可获取相应的DataFrame格式数据。

Sports |体育

  • Football Strategy:Thousands of scenarios to make the best coaching decisions.876KB

  • Horses for Course:Horse-racing data for predicting race results.19MB

  • NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.

Medicines |医药

Foods |饮食

  • Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3个文件,共343KB

  • malicious-urls:数十万条级别的URL以及其是否Malicious标签.

  • MovieLens:海量的关于电影影评数据

Governments |政务

Others

Links

上一页
下一页