|
20 Newsgroups
|
Carnegie Mellon University |
1999 |
18 MB |
documents |
20K |
Text Classification, Text Clustering |
|
|
7Sectors-WebKB
|
CMU World Wide Knowledge Base |
2001 |
6 MB |
documents |
5K |
Text Classification, Text Clustering |
|
|
A Corpus of Plagiarised Short Answers
|
University of Sheffield |
2009 |
80 KB |
documents |
100 |
Plagiarism Detection |
|
|
ABCD (Agreement By Create Debaters)
|
Sara Rosenthal |
2015 |
42 MB |
dialogues |
10K |
Conversation Analysis (written, human-human) |
|
|
Annotated Customer Reviews
|
Simon Fraser University Burnaby |
2004 |
870 KB |
None |
- |
Sentiment Analysis |
|
|
AOL Query Log
|
AOL |
2006 |
2 GB |
queries |
112M |
Query Log Analysis |
|
|
Araucaria Argumentation Corpus
|
University of Dundee |
2014 |
9 MB |
examples |
664 |
Computational Argumentation |
|
|
Arguing Subjectivity Corpus
|
University of Pittsburgh |
2012 |
732 KB |
documents |
84 |
Computational Argumentation |
|
|
Argument Annotated Essays, v1
|
TU Darmstadt |
2014 |
7 MB |
essays |
90 |
Computational Argumentation |
|
|
Argument Annotated Essays, v2
|
TU Darmstadt |
2016 |
6 MB |
essays |
402 |
Computational Argumentation |
|
|
Argumentative Microtext Corpus, parts 1 + 2
|
Potsdam University |
2018 |
7 MB |
texts |
290 |
Computational Argumentation |
|
|
Arxiv-PubMed Corpus
|
Georgetown University |
2018 |
4.2 GB |
article-abstract pairs |
350K |
Text Summarization, Scientific Document Summarization |
|
|
AWTP (Agreement in Wikipedia Talk Pages)
|
Sara Rosenthal |
2012 |
235 KB |
dialogues |
822 |
Conversation Analysis (written, human-human) |
|
|
Bergsma-Wang-Corpus 2007
|
S. Bergsma and Q. I. Wang |
2007 |
2 MB |
queries |
2K |
Web Search Analysis |
|
|
BigPatent Summarization Corpus
|
Khoury College of Computer Sciences |
2019 |
6 GB |
article-summary pairs (US patents) |
1M |
Text Summarization |
|
|
BLOGS06 test collection
|
University of Glasgow |
2006 |
- |
documents |
4M |
Link Analysis |
|
|
BNC Writing Errors
|
J. Wagner et al. |
2007 |
274 MB |
None |
- |
Writing Error Detection |
|
|
British National Corpus (XML)
|
BNC Consortium |
2007 |
5 GB |
texts |
4K |
Text Analysis (English) |
|
|
Brown Corpus
|
Brown University |
2011 |
22 MB |
documents |
500 |
Text Analysis (English) |
|
|
CEEAUS 2010 Beta Edition
|
Kobe University |
2010 |
- |
documents |
2K |
Cross-Language Analysis |
|
|
Change My View Modes
|
Columbia University |
2017 |
- |
discussion threads |
78 |
Computational Argumentation |
|
|
CLEANEVAL 2007
|
University of Trento and University of Leeds |
2007 |
15 MB |
documents |
1K |
Main Content Extraction |
|
|
CLEF-IP 2009
|
Information Retrieval Facility Society (IRF) |
2009 |
14 GB |
documents |
2M |
Patent Retrieval |
|
|
CLEF-IP 2010
|
Information Retrieval Facility Society (IRF) |
2010 |
9 GB |
documents |
3M |
Patent Retrieval |
|
|
ClueWeb09
|
Carnegie Mellon University |
2009 |
4 TB |
web pages |
1B |
Web Mining |
|
|
ClueWeb12
|
Carnegie Mellon University |
2012 |
5 TB |
web pages |
733M |
Web Mining |
|
|
CNN-DailyMail
|
IBM |
2016 |
1 GB |
article-summary pairs |
200K |
Text Summarization |
|
|
CoNLL-2003
|
University of Antwerpen |
2003 |
12 MB |
None |
- |
Named Entity Recognition |
|
|
CoPhIR
|
Consiglio Nazionale delle Ricerche (ISTI-CNR) |
2003 |
54 GB |
images |
106M |
Image Retrieval |
|
|
CORE
|
The Open University |
2018 |
330 GB |
documents |
123M |
Data Mining |
|
|
DBLP
|
University of Massachusetts Amherst |
2006 |
910 MB |
None |
- |
Network Analysis |
|
|
Dbpedia 3.5
|
DBpedia |
2010 |
8 GB |
None |
- |
Data Mining |
|
|
DMOZ
|
Open Directory Project |
2010 |
11 GB |
None |
- |
Clustering and Clusterlabeling and Data Mining |
|
|
DoQA
|
Ixa |
2020 |
4 MB |
dialogues |
2437 |
Conversation Analysis (written, human-human) |
|
|
ECML PKDD Discovery Challenge 2008
|
ECML |
2008 |
304 MB |
lines |
17M |
Collaborative Filtering and Spam Detection |
|
|
ESL 123 Mass Noun Examples
|
Microsoft Corporation |
2006 |
204 KB |
sentences |
123 |
Cross-Language Analysis |
|
|
Essay Argument Strength
|
UT Dallas |
2015 |
30 KB |
scores |
1K |
Essay scoring |
|
|
Essay Organization
|
UT Dallas |
2010 |
30 KB |
scores |
1K |
Essay scoring |
|
|
Essay Prompt Adherence
|
UT Dallas |
2014 |
38 KB |
scores |
830 |
Essay scoring |
|
|
Essay Thesis Clarity
|
UT Dallas |
2013 |
6 MB |
scores |
830 |
Essay scoring |
|
|
Europarl (v1 & v3)
|
University of Edinburgh |
2007 |
3 GB |
None |
- |
Machine Translation |
|
|
European Corpus Initiative Multilingual Corpus I
|
European Corpus Initiative |
1994 |
824 MB |
words |
49M |
Text Analysis (Multilingual) |
|
|
Falko Essaykorpus L2 V2
|
Institut für deutsche Sprache und Linguistik |
2005 |
5 MB |
documents |
248 |
Interlanguage Analysis |
|
|
Finegrained Sentiment
|
Uppsala University |
2011 |
4 MB |
reviews |
294 |
Sentiment Analysis |
|
|
General Inquirer Dictionary
|
Harvard University |
1966 |
4 MB |
categories |
182 |
Sentiment Analysis |
|
|
Google Books N-Gram 20090715
|
Google |
2009 |
898 GB |
None |
- |
Data Mining |
|
|
Google Web 1T 5-gram Version 1
|
Google |
2006 |
55 GB |
n-grams |
5B |
Text Analysis (English) |
|
|
IBM Debater - Claim Sentences Search
|
IBM |
2018 |
600 MB |
topic conclusion pairs |
2M |
Argument Search |
|
|
IBM Debater - Claim Stance Dataset
|
IBM |
2017 |
8 MB |
topic conclusion |
2K |
Stance Classification |
|
|
IBM Debater - Claims and Evidence, ACL-14
|
IBM |
2014 |
3 MB |
topic argument pairs |
1K |
Argument Mining |
|
|
IBM Debater - Claims and Evidence, EMNLP-2015
|
IBM |
2015 |
8 MB |
topic argument pairs |
5K |
Argument Mining |
|
|
IBM Debater - Evidence Sentences
|
IBM |
2018 |
3 MB |
topic premise pairs |
6K |
Argument Search |
|
|
IBM Debater - IBM-ArgQ-Rank-30kArgs
|
IBM |
2019 |
2 MB |
arguments |
30K |
Argument Quality |
|
|
IBM Debater - Mention Detection Benchmark
|
IBM |
2018 |
2 MB |
sentences |
3K |
Mention Detection |
|
|
IBM Debater - Recorded Debating Dataset
|
IBM |
2018 |
2 MB |
discussions |
60 |
Computational Argumentation |
|
|
IBM Debater - Sentiment Composition Lexicon
|
IBM |
2018 |
10 MB |
words |
66K |
Sentiment Analysis |
|
|
IBM Debater - Sentiment Lexicon of Idiomatic Expressions
|
IBM |
2018 |
3 MB |
phrases |
5K |
Sentiment Analysis |
|
|
IBM Debater - TR9856
|
IBM |
2015 |
2 MB |
phrase pairs |
10K |
Semantic Relatedness |
|
|
IBM Debater - Wikipedia Category Stance
|
IBM |
2018 |
1 MB |
wikipedia category |
5K |
Stance Classification |
|
|
IBM Debater - Word
|
IBM |
2018 |
4 MB |
wikipedia concept pairs |
19K |
Semantic Relatedness |
|
|
ICWSM 2009 Data Challenge
|
ICWSM |
2009 |
37 GB |
None |
- |
Network Analysis |
|
|
imat2009 dataset
|
Yandex |
2009 |
650 MB |
None |
- |
Machine-learned Ranking |
|
|
Intelligence Squared Debates (IQ2)
|
Zhang et al. |
2016 |
4 MB |
dialogues |
108 |
Conversation Analysis (spoken, human-human) |
|
|
International Corpus of Learner English v2
|
Center for English Corpus Linguistics |
2009 |
92 MB |
documents |
6K |
Language Analysis |
|
|
Internet Argument Corpus v2
|
NLDS@UC Santa Cruz |
2016 |
3 GB |
dialogues |
11K |
Conversation Analysis (written, human-human) |
|
|
IP2Location LITE databases 2016-20
|
IP2Location |
2016 |
5 GB |
years |
5 |
IP-geolocation and proxies |
|
|
Key-value Retrieval Dataset
|
Stanford University |
2017 |
1 MB |
dialogues |
3K |
Conversation Analysis (written, human-wizard) |
|
|
Koppel Authorship Corpus
|
M. Koppel and J. Schler |
2004 |
4 MB |
None |
- |
Authorship Verification |
|
|
Learning To Rank 3
|
Microsoft |
2008 |
8 GB |
None |
- |
Machine-learned Ranking |
|
|
Lee 50 Documents
|
M. D. Lee et al. |
2005 |
130 KB |
documents |
50 |
Text Similarity Analysis |
|
|
Maluuba Frames
|
Maluuba (Microsoft) |
2017 |
4 MB |
dialogues |
1K |
Conversation Analysis (written, human-wizard) |
|
|
MANtIS
|
Lambda-Lab at TU Delft |
2019 |
6 GB |
dialogues |
80K |
Conversation Analysis (written, human-human) |
|
|
MEDLINE-PubMed Corpus
|
University of Zurich |
2018 |
7 GB |
article-abstract & abstract-title pairs |
5M |
Text Summarization, Scientific Document Summarization |
|
|
METER Corpus
|
Department of Journalism and Department of Computer Science at Sheffield University |
2002 |
10 MB |
None |
- |
Text Reuse |
|
|
MIR Flickr 2008
|
LIACS Medialab at Leiden University, Netherlands |
2008 |
3 GB |
documents |
25K |
Image Retrieval |
|
|
MISC
|
Microsoft |
2017 |
23 GB |
dialogues |
110 |
Conversation Analysis (spoken, human-human) |
|
|
Montclair Electronic Language Database
|
Montclair State University |
2001 |
56 KB |
documents |
33 |
Cross-Language Analysis |
|
|
Movie Review Data
|
Cornell University |
2004 |
219 MB |
reviews |
12K |
Sentiment Analysis |
|
|
Movielens
|
University of Minnesota |
1998 |
74 MB |
ratings |
11M |
Collaborative Filtering |
|
|
MPC (Multi-Party Chat)
|
Shaikh et al. |
2010 |
2 MB |
dialogues |
14 |
Conversation Analysis (written, human-human) |
|
|
MSMARCO Conversational Search
|
Microsoft |
2019 |
1 GB |
synthetic search sessions |
2M |
Next Query Prediction |
|
|
Multi Domain Sentiment Dataset (Processed ACL)
|
John Hopkins University |
2007 |
29 MB |
None |
- |
Sentiment Analysis |
|
|
Multi-News
|
Yale University |
2019 |
676 MB |
multiple articles-summary pairs |
54K |
Text Summarization, Multi-document |
|
|
MultiWOZ 2.1
|
M. Eric et al. |
2020 |
19 MB |
dialogues |
10K |
Conversation Analysis (written, human-wizard) |
|
NBC 2016 Russian Troll Tweets
|
NBC |
2018 |
34 MB |
tweets |
267K |
Propaganda detection |
|
|
Netflix Challenge (Partial)
|
Netflix |
2006 |
2 GB |
None |
- |
Collaborative Filtering |
|
|
New York Times Corpus
|
New York Times |
2008 |
3 GB |
articles |
2M |
Text Mining |
|
|
Newsroom
|
Cornell University |
2018 |
5 GB |
article-summary pairs |
1.3M |
Text Summarization |
|
|
ODP239 |
C. Carpineto and G. Romano |
2009 |
5 MB |
None |
- |
Subtopic Information Retrieval |
|
|
OHSUMED Test Collection
|
Oregon Health & Science University |
1994 |
461 MB |
None |
- |
Text Clustering |
|
|
OpenWebText Corpus
|
Brown University |
2019 |
40 GB |
documents |
8M |
Language Modeling, Text Synthesis |
|
|
OPUS (Europarl3_0b and EMEA0)
|
Jörg Tiedemann |
2009 |
9 GB |
languages |
22 |
Machine Translation |
|
|
OR-QuAC
|
C. Qu et al. |
2020 |
10 GB |
dialogues |
6K |
Conversation Analysis (written, human-wizard), Question Answering |
|
|
QuAC
|
E. Choi et al. |
2018 |
75 MB |
dialogues |
14K |
Conversation Analysis (written, human-wizard), Question Answering |
|
|
RadioTalk
|
Laboratory for Social Machines, MIT Media Lab |
2019 |
9 GB |
words |
3B |
Language Analysis |
|
|
Reason Identification and Classification Dataset
|
UT Dallas |
2014 |
4 MB |
None |
- |
Computational Argumentation |
|
|
Reddit TIFU corpus
|
Seoul National University |
2019 |
640 MB |
content-summary pairs |
123K |
Text Summarization |
|
Request For Comments Collections (to 4501)
|
RFC Editor |
2008 |
55 MB |
documents |
4K |
Data Mining |
|
|
Reuters 21578 (22173)
|
Reuters, David D. Lewis |
1996 |
8 MB |
articles |
22K |
Text Clustering |
|
|
Reuters RCV1
|
Reuters, David D. Lewis |
2000 |
1 GB |
documents |
365 |
Text Clustering |
|
|
Reuters RCV1 - CCAT split
|
Reuters, David D. Lewis |
2002 |
2 GB |
None |
- |
Machine Learning |
|
|
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection
|
National Research Council of Canada |
2009 |
166 MB |
None |
- |
Cross-Language Categorization |
|
Rovereto Twitter N-Gram Corpus
|
University of Trento, Italy |
2011 |
5 GB |
tweets |
75M |
Social Network Analysis |
|
|
ScisummNet Corpus
|
Yale University |
2019 |
15 MB |
scientific paper-summary pairs (with citation networks) |
1000 |
Text Summarization, Scientific Document Summarization |
|
|
SILS Learner Corpus of English
|
Waseda University |
2007 |
16 MB |
None |
- |
Cross-Language Analysis |
|
|
SMS Spam Collection v
|
T. A. Almeida and J. M. G. Hidalgo |
2011 |
210 KB |
messages |
6K |
Spam Identification |
|
|
Spoken Conversational Search Data Set
|
J.R. Trippas et al. |
2017 |
260 KB |
dialogues |
101 |
Conversation Analysis (written, human-human) |
|
|
Spotify Podcasts Dataset
|
Clifton et al. |
2020 |
2 TB |
hours |
50K |
Conversation Analysis (spoken, human-human) |
|
|
TED-LIUM Release 3
|
Ubiqus and LIUM |
2018 |
50 GB |
hours |
452 |
Speech Recognition |
|
|
The JRC-Acquis Multilingual Parallel Corpus (3)
|
European Commission's Office for Official Publications (OPOCE) |
2009 |
2 GB |
None |
- |
Cross-Language Research |
|
|
TIPSTER Complete
|
Advanced Research Projects Agency |
1993 |
1 MB |
None |
- |
Information Retrieval |
|
|
Topical Chat Dataset
|
Amazon |
2019 |
76 MB |
dialogues |
11K |
Conversation Analysis (written, human-human) |
|
|
TREC vol4
|
National Institute of Standards and Technology (NIST) |
1996 |
436 MB |
documents |
295K |
Data Mining |
|
|
TREC vol5
|
National Institute of Standards and Technology (NIST) |
1997 |
389 MB |
documents |
260K |
Data Mining |
|
|
TREC web
|
National Institute of Standards and Technology (NIST) |
1999 |
90 GB |
None |
- |
Data Mining |
|
|
TripAdvisor Data Set
|
University of Illinois at Urbana-Champaign |
2010 |
220 MB |
None |
- |
Opinion Mining |
|
|
Tswana Learner English Corpus
|
Center for Text Technology |
2006 |
2 MB |
None |
- |
Cross-Language Analysis |
|
Twitter tweets
|
Yang and Leskovec |
2011 |
26 GB |
tweets |
467M |
Social Network Analysis |
|
Twitter tweets (RecSys Challenge)
|
Twitter |
2020 |
76 GB |
tweets |
160M |
Social Network Analysis |
|
|
UKPConvArg1
|
TU Darmstadt |
2016 |
21 MB |
argument pairs |
16K |
Computational Argumentation |
|
|
UKPConvArg2
|
TU Darmstadt |
2016 |
23 MB |
argument pairs |
9K |
Computational Argumentation |
|
|
Uppsala Student English
|
Uppsala University |
2001 |
3 MB |
documents |
2K |
Cross-Language Analysis |
|
|
US Bill Summarization Corpus
|
FiscalNote Research |
2019 |
64 MB |
article-summary pairs (US bills) |
22K |
Text Summarization |
|
|
USPTO Patents from 2001 to 2010
|
U.S. Patent & Trademark Office |
2010 |
10 TB |
None |
- |
Patent Analysis |
|
|
VQuAnDa
|
Kacupaj et al. |
2020 |
2 MB |
question-answer-SPARQL query triplets |
5K |
Answer Verbalization |
|
|
WaCKy: deWaC
|
Web-As-Corpus Kool Yinitiative |
2009 |
26 GB |
words |
2B |
Text Analysis (German) |
|
|
WaCKy: frWaC
|
Web-As-Corpus Kool Yinitiative |
2009 |
5 GB |
words |
2B |
Text Analysis (French) |
|
|
WaCKy: itWaC
|
Web-As-Corpus Kool Yinitiative |
2009 |
31 GB |
words |
2B |
Text Analysis (Italian) |
|
|
WaCKy: sdeWaC
|
Web-As-Corpus Kool Yinitiative |
2009 |
20 GB |
words |
1B |
Text Analysis (German) |
|
|
WaCKy: ukWaC
|
Web-As-Corpus Kool Yinitiative |
2009 |
15 GB |
words |
2B |
Text Analysis (English) |
|
|
WaCKy: WaCkypedia_EN
|
Web-As-Corpus Kool Yinitiative |
2009 |
6 GB |
words |
1B |
Text Analysis (English) |
|
|
WCEP MDS Dataset: Wikipedia Current Events Portal
|
Aylien Ltd., Dublin, Ireland |
2020 |
2 GB |
document clusters with one human-written summary per cluster |
2.39M |
Text Summarization, Multi-document |
|
|
Web People Search Corpus (WePS-1)
|
NLP Group (UNED), Proteus Project (NYU) |
2007 |
295 MB |
web pages |
2K |
Person Disambiguation, Text Clustering |
|
|
Web People Search Corpus (WePS-2)
|
NLP Group (UNED), Proteus Project (NYU) |
2009 |
328 MB |
web pages |
3K |
Person Disambiguation, Text Clustering |
|
|
Web People Search Corpus (WePS-3)
|
NLP Group (UNED), Proteus Project (NYU) |
2010 |
571 MB |
web pages |
50K |
Person Disambiguation, Text Clustering |
|
|
WikiHow Summarization Corpus
|
University of California |
2018 |
2 GB |
article-summary, paragraph-summary pairs |
230K |
Text Summarization |
|
|
Wikipedia Full Dump
|
Wikimedia Foundation |
2011 |
5 TB |
None |
- |
Data Mining |
|
|
Wikipedia History Snapshots
|
Wikimedia Foundation |
2006 |
32 GB |
None |
- |
Data Mining |
|
|
Wikipedia Participation Challenge
|
Wikimedia Foundation |
2011 |
976 MB |
None |
- |
User Behaviour Prediction |
|
|
Wikipedia Revision Dump
|
Wikimedia Foundation |
2006 |
46 GB |
None |
- |
Data Mining |
|
|
Wikipedia Revision Dump
|
Wikimedia Foundation |
2008 |
133 GB |
None |
- |
Data Mining |
|
|
Wikipedia Snapshots
|
Wikimedia Foundation |
2006 |
280 GB |
None |
- |
Data Mining |
|
|
Wordsim353
|
L. Finkelstein et al. |
2002 |
60 KB |
word pairs |
353 |
Word Similarities |
|
|
Wortschatz Leipzig
|
Universität Leipzig |
2006 |
8 GB |
languages |
15 |
Text Analysis (Multilingual) |
|
|
XSum Corpus
|
University of Edinburgh |
2018 |
240 MB |
article-summary pairs |
214K |
Text Summarization |
|
|
Yahoo Learning To Rank Challenge 2010
|
Yahoo |
2010 |
421 MB |
None |
- |
Document Ranking |
|
|
Yahoo N-Grams
|
Yahoo |
2006 |
13 GB |
None |
- |
Text Analysis (English) |
|