常用数据集介绍
搜狗实验室数据

搜狗实验室(Sogo Labs)是搜狗搜索核心研发团队对外交流的窗口,包含数据资源、数据挖掘云、研究合作等几个栏目。数据资源包括评测集合、语料数据、新闻数据、图片数据和自然语言处理相关数据,网址为这里
互联网语料库(SogouT)
<doc>
<docno>页面 ID</docno>
<url>页面 URL</url>
页面原始内容
</doc>
为了满足不同需求,
- 迷你版
( 样例数据, 61KB) :tar.gz 格式,zip 格式 - 完整版
(1TB) :( 硬盘拷贝) - 历史版本
(130GB) :V2.0( 硬盘拷贝)
全网新闻数据(SogouCA)
<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>
为了满足不同需求,
- 迷你版
( 样例数据, 101KB) :tar.gz 格式,zip 格式 - 完整版
(711MB) :tar.gz 格式,zip 格式 - 历史版本:
- 完整版( 同时提供硬盘拷贝,1.02GB) :tar.gz 格式- 迷你版( 样例数据, 3KB) :tar.gz 格式- 精简版( 一个月数据, 437MB) :tar.gz 格式
搜狐新闻数据(SogouCS)
<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>
为了满足不同需求,
- 迷你版
( 样例数据, 110KB) :tar.gz 格式,zip 格式 - 完整版
(648MB) :tar.gz 格式,zip 格式 - 历史版本:
- 完整版( 同时提供硬盘拷贝,65GB) :tar.gz 格式- 迷你版( 样例数据, 1KB) :tar.gz 格式- 精简版( 一个月数据, 347MB) :tar.gz 格式- 特别版( 王灿辉WWW08 论文数据, 647KB) :tar.gz 格式
文本分类评价(SogouTCE)
URL前缀\t对应类别标记
互联网词库(SogouW)
词A 词频 词性1 词性2 … 词性N
词B 词频 词性1 词性2 … 词性N
词C 词频 词性1 词性2 … 词性N
IMDB Reviews
互联网电影资料库(Internet Movie Database,简称
Sentiment140
- 推文的极性
- 推文的
ID - 推文的日期
- 问题
- 推文的用户名
- 推文的文本
Yelp Reviews

数据集格式分为
{
// string, 22 character unique review id
"review_id": "zdSx_SD6obEhz9VrW9uAWA",
// string, 22 character unique user id, maps to the user in user.json
"user_id": "Ha3iJu77CxlrFm-vQRs_8g",
// string, 22 character business id, maps to business in business.json
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// integer, star rating
"stars": 4,
// string, date formatted YYYY-MM-DD
"date": "2016-03-09",
// string, the review itself
"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
// integer, number of useful votes received
"useful": 0,
// integer, number of funny votes received
"funny": 0,
// integer, number of cool votes received
"cool": 0
}
另外
{
// string, text of the tip
"text": "Secret menu - fried chicken sando is da bombbbbbb Their zapatos are good too.",
// string, when the tip was written, formatted like YYYY-MM-DD
"date": "2013-09-20",
// integer, how many likes it has
"likes": 172,
// string, 22 character business id, maps to business in business.json
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// string, 22 character unique user id, maps to the user in user.json
"user_id": "49JhAJh8vSQ-vM4Aourl0g"
}
专门有个开源项目用于解析该
Enron-Spam

正常邮件内容举例如下:
Subject: christmas baskets the christmas baskets have been ordered . we have ordered several baskets . individual earth - sat freeze - notis smith barney group baskets rodney keys matt rodgers charlie notis jon davis move team phillip randle chris hyde harvey freese faclities
垃圾邮件内容举例如下:
Subject: fw : this is the solution i mentioned lscoo thank you , your email address was obtained from a purchased list ,reference # 2020 mid = 3300 . if you wish to unsubscribe from this list, please click here and enter your name into the remove box . if you have previously unsubscribed and are still receiving this message, you may email our abuse control center, or call 1 - 888 - 763 - 2497, or write us at : nospam , 6484 coral way, miami, fl, 33155 " . 2002 web credit inc . all rights reserved .
babi 阅读理解数据集
1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary? bathroom 1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel? hallway 4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel? hallway 4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel? office 11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra? bathroom 8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra? bathroom 2
4 Mary went to the bedroom.
5 Daniel moved to the hallway.
6 Where is Sandra? bathroom 2
7 John went to the garden.
8 John travelled to the office.
9 Where is Sandra? bathroom 2
10 Daniel journeyed to the bedroom.
11 Daniel travelled to the hallway.
12 Where is John? office 8
抽象表示格式为:
ID text
ID text
ID text
ID question[tab]answer[tab]supporting fact IDS.
项目主页地址为:
数据下载地址为:
http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1.tar.gz
下载压缩包,解压后,全部文件保存在
qa10_indefinite-knowledge_test.txt
qa1_single-supporting-fact_test.txt
qa10_indefinite-knowledge_train.txt
qa1_single-supporting-fact_train.txt
qa11_basic-coreference_test.txt
qa20_agents-motivations_test.txt
qa11_basic-coreference_train.txt
qa20_agents-motivations_train.txt
qa12_conjunction_test.txt
qa2_two-supporting-facts_test.txt
qa12_conjunction_train.txt
qa2_two-supporting-facts_train.txt
qa13_compound-coreference_test.txt
qa3_three-supporting-facts_test.txt
qa13_compound-coreference_train.txt
qa3_three-supporting-facts_train.txt
qa14_time-reasoning_test.txt
qa4_two-arg-relations_test.txt
qa14_time-reasoning_train.txt
qa4_two-arg-relations_train.txt
qa15_basic-deduction_test.txt
qa5_three-arg-relations_test.txt
qa15_basic-deduction_train.txt
qa5_three-arg-relations_train.txt
qa16_basic-induction_test.txt
qa6_yes-no-questions_test.txt
qa16_basic-induction_train.txt
qa6_yes-no-questions_train.txt
qa17_positional-reasoning_test.txt
qa7_counting_test.txt
qa17_positional-reasoning_train.txt
qa7_counting_train.txt
qa18_size-reasoning_test.txt
qa8_lists-sets_test.txt
qa18_size-reasoning_train.txt
qa8_lists-sets_train.txt
qa19_path-finding_test.txt
qa9_simple-negation_test.txt
qa19_path-finding_train.txt
qa9_simple-negation_train.txt
SMS Spam Collection
ham What you doing?how are you?
ham Ok lar… Joking wif u oni…
ham dun say so early hor… U c already then say…
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham Siva is in hostel aha:-.
ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

UBUNTU DIALOG CORPUS
UBUNTU DIALOG CORPUS(UDC)是可用的最大的公共对话数据集之一
cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/ubuntu_dialogs.tgz
它基于公共

Hate speech identification
- 包含仇恨言论;
- 是冒犯性的,但没有仇恨言论;
- 根本没有冒犯性。
由
下载链接为:
https://github.com/t-davidson/hate-speech-and-offensive-language
Twitter Progressive issues sentiment analysis
下载链接为:
今日头条新闻文本分类数据集
今日头条新闻文本分类数据集共
100 民生 故事news_story 101 文化 文化news_culture 102 娱乐 娱乐news_entertainment 103 体育 体育news_sports 104 财经 财经news_finance 106 房产 房产news_house 107 汽车 汽车news_car 108 教育 教育news_edu 109 科技 科技news_tech 110 军事 军事news_military 112 旅游 旅游news_travel 113 国际 国际news_world 114 证券 股票stock 115 农业 三农news_agriculture 116 电竞 游戏news_game
数据格式为:
6552431613437805063_!_102_!_news_entertainment_!_谢娜为李浩菲澄清网络谣言,
之后她的两个行为给自己加分_!_佟丽娅,网络谣言,快乐大本营,李浩菲,谢娜,观众们
每行为一条数据,以 _!_
分割的个字段,从前往后分别是 新闻
项目主页在
也可以直接使用