This page organizes all corpora which have resulted from or have been used in our research. The data is made available to Webis-external researchers in various places: (1) corpora that have been officially released by Webis as well as (2) corpora of the PAN series can be downloaded here, (3) internal Webis corpora (which will be officially released in the future) are supplied upon request, (4) affiliated corpora made available by courtesy of our research partners can be downloaded here, (5) other corpora can be downloaded from their original publisher/creator. Most of our released corpora are hosted at Zenodo Zenodo and are indexed in the Google Dataset Search Google Dataset Search; a few larger corpora are available in the Internet Archive Internet Archive; the –symbol indicates a browsing facility for the respective corpus.

PAN Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Alvi15-Text-Alignment-en-fa Webis Group 2015 2 MB 200 documents Originality Zenodo Google Dataset Search
C10-Attribution Webis Group 2015 4 MB None None Author Identification Zenodo Google Dataset Search
C50-Attribution Webis Group 2015 17 MB None None Author Identification Zenodo Google Dataset Search
Cheema15-Text-Alignment-en Webis Group 2015 4 MB None None Originality Zenodo Google Dataset Search
Hanfi15-Text-Alignment-en-ur Webis Group 2015 3 MB None None Originality Zenodo Google Dataset Search
Khoshnavataher15-Text-Alignment-fa Webis Group 2015 16 MB None None Originality Zenodo Google Dataset Search
Kong15-Text-Alignment-zh Webis Group 2015 3 MB None None Originality Zenodo Google Dataset Search
Mohtaij15-Text-Alignment-en Webis Group 2015 57 MB None None Originality Zenodo Google Dataset Search
Palkovskii15-Text-Alignment-en Webis Group 2015 26 MB None None Originality Zenodo Google Dataset Search
PAN-PC-09 Webis Group 2009 2 GB 41K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-10 Webis Group 2010 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-11 Webis Group 2011 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-SemEval-Hyperpartisan-News-Detection-19 Webis & Factmata 2019 1 GB 751K articles Hyperpartisan News Detection Zenodo Google Dataset Search
PAN-WQF-12 Webis Group 2012 4 GB 2M documents Quality Flaw Prediction Zenodo Google Dataset Search
PAN-WVC-10 Webis Group 2010 439 MB 32K documents Vandalism Detection Zenodo Google Dataset Search
PAN-WVC-11 Webis Group 2011 371 MB 24K documents Vandalism Detection Zenodo Google Dataset Search
PAN11-Attribution Webis Group 2011 3 MB None None Author Identification Zenodo Google Dataset Search
PAN12-Attribution Webis Group 2012 9 MB None None Author Identification Zenodo Google Dataset Search
PAN12-Sexual-Predator-Identification Webis Group 2012 92 MB None None Deception Detection Zenodo Google Dataset Search
PAN12-Source-Retrieval Webis Group 2012 1 MB None None Originality Zenodo Google Dataset Search
PAN12-Text-Alignment Webis Group 2012 783 MB None None Originality Zenodo Google Dataset Search
PAN13-Author-Profiling Webis Group 2013 713 MB None None Author Profiling Zenodo Google Dataset Search
PAN13-Source-Retrieval Webis Group 2013 3 MB None None Originality Zenodo Google Dataset Search
PAN13-Text-Alignment Webis Group 2013 35 MB None None Originality Zenodo Google Dataset Search
PAN13-Verification Webis Group 2013 1 MB None None Author Identification Zenodo Google Dataset Search
PAN14-Author-Profiling Webis Group 2014 205 MB None None Author Profiling Zenodo Google Dataset Search
PAN14-Source-Retrieval Webis Group 2014 7 MB None None Originality Zenodo Google Dataset Search
PAN14-Text-Alignment Webis Group 2014 22 MB None None Originality Zenodo Google Dataset Search
PAN14-Verification Webis Group 2014 9 MB None None Author Identification Zenodo Google Dataset Search
PAN15-Author-Profiling Webis Group 2015 2 MB None None Author Profiling Zenodo Google Dataset Search
PAN15-Source-Retrieval Webis Group 2015 7 MB None None Originality Zenodo Google Dataset Search
PAN15-Verification Webis Group 2015 3 MB None None Author Identification Zenodo Google Dataset Search
PAN16-Author-Masking PAN 2016 2 MB 205 cases Author Obfuscation Zenodo Google Dataset Search
PAN16-Author-Profiling Webis Group 2016 2 MB None None Author Profiling Zenodo Google Dataset Search
PAN16-Clustering Webis Group 2016 3 MB None None Author Identification Zenodo Google Dataset Search
PAN17-Author-Profiling Webis Group 2017 254 MB None None Author Profiling Zenodo Google Dataset Search
PAN17-Clustering Webis Group 2017 1 MB None None Author Identification Zenodo Google Dataset Search
PAN17-Style-Change-Detection Webis Group 2017 8 MB None None Multi-Author Analysis Zenodo Google Dataset Search
PAN18-Attribution Webis Group 2018 4 MB 2K cases Author Identification Zenodo Google Dataset Search
PAN18-Author-Profiling PAN 2018 7 GB 8K cases Author Profiling Zenodo Google Dataset Search
PAN18-Style-Change-Detection Webis Group 2018 8 MB 3K cases Multi-Author Analysis Zenodo Google Dataset Search
PAN19-Attribution Webis Group 2019 13 MB None None Author Identification Zenodo Google Dataset Search
PAN19-Bots-and-Gender-Profiling Webis Group 2019 38 MB None None Author Profiling Zenodo Google Dataset Search
PAN19-Celebrity-Profiling Webis Group 2019 3 GB None None Author Profiling Zenodo Google Dataset Search
PAN19-Style-Change-Detection Webis Group 2019 10 MB None None Multi-Author Analysis Zenodo Google Dataset Search
PAN20-Authorship-Verification Webis Group 2020 838 MB None None Authorship Verification Zenodo Google Dataset Search
PAN20-Authorship-Verification (Large) Webis Group 2020 4 GB None None Authorship Verification Zenodo Google Dataset Search
PAN20-Celebrity-Profiling Webis Group 2020 7 GB None None Author Profiling Zenodo Google Dataset Search
PAN20-Profiling-Fake-News-Spreaders-in-Twitter Webis Group 2020 8 MB None None Author Profiling Zenodo Google Dataset Search
PAN20-Style-Change-Detection Webis Group 2020 98 MB None None Multi-Author Analysis Zenodo Google Dataset Search
Released Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Arg-Microtexts Synthesis Benchmark Webis Group 2018 4 MB 260 arguments Computational Argumentation Zenodo
args.me corpus Webis Group 2019 876 MB 388K arguments Computational Argumentation Zenodo Google Dataset Search
ArguAna Counterargs Webis Group 2018 106 MB 7K arguments Computational Argumentation Zenodo
ArguAna TripAdvisor Webis Group, FG Engels 2014 283 MB 2K reviews Sentiment Analysis Zenodo
BuzzFeed-Webis Fake News Corpus 16 Webis Group 2018 5 GB 1K articles News analysis Zenodo Google Dataset Search
Genre-KI-04 Webis Group 2004 11 MB 1K documents Web Genre Analysis Zenodo Google Dataset Search
LFA-11 Webis Group, FG Engels 2011 5 MB - None Genre and Sentiment Analysis Zenodo Google Dataset Search
WDVC-15 FG Engels, Webis Group 2015 5 GB 24M revisions Vandalism Detection Zenodo Google Dataset Search
WDVC-16 FG Engels, Webis Group 2016 30 GB 83M revisions Vandalism Detection Zenodo Google Dataset Search
Webis-Ambient-15 Webis Group 2015 114 MB 6K documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-ArgKB-20 Webis Group 2020 1 MB 5K argumentative-relations Computational Argumentation Zenodo
Webis-ArgQuality-20 Webis Group 2020 3 MB 1K arguments Computational Argumentation Zenodo
Webis-ArgRank-17 Webis Group 2018 13 MB 18K arguments Computational Argumentation Zenodo
Webis-Argument-Framing-19 Webis Group 2019 7 MB 12K arguments Computational Argumentation and Framing Zenodo Google Dataset Search
Webis-Bias-Flipper-18 Webis Group 2018 13 MB 6K documents Natural Language Generation Zenodo Google Dataset Search
Webis-Clickbait-16 Webis Group 2016 255 MB 3K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-Clickbait-17 Webis Group 2017 - 20K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-CLS-10 Webis Group 2010 530 MB 800K documents Cross-Language Text Classification Zenodo Google Dataset Search
Webis-CMV-20 Webis Group 2020 3 GB - argument pairs Computational Argumentation Zenodo
Webis-CompQuestions-20 Webis Group 2020 1 MB 15K questions Comparative Question Classification Zenodo Google Dataset Search
Webis-CPC-11 Webis Group 2011 19 MB 8K paraphrases Plagiarism Detection Zenodo Google Dataset Search
Webis-Debate-16 Webis Group 2016 908 KB 27K text segments Computational Argumentation Zenodo Google Dataset Search
Webis-Editorial-Quality-18 Webis Group 2018 3 MB 1K documents Computational Argumentation Zenodo Google Dataset Search
Webis-Editorials-16 Webis Group 2016 5 MB 300 documents Computational Argumentation Zenodo Google Dataset Search
Webis-Gmane-19 Webis Group 2020 160 GB 153M emails Dialog Analysis Zenodo Google Dataset Search
Webis-KIQC-13 Webis Group 2013 1 MB 3K questions Known-Item Search Zenodo Google Dataset Search
Webis-Mnemonics-17 Webis Group 2017 log entries Password analysis Zenodo Google Dataset Search
Webis-NIL-21 Webis Group 2021 392 KB 37K log entries query identification Zenodo Google Dataset Search
Webis-ODP-10 Webis Group 2010 113 MB 5M documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-PC-08 Webis Group 2008 298 MB - None Plagiarism Detection Zenodo Google Dataset Search
Webis-PRA-12 Webis Group 2012 884 KB 14K company names Spelling Error Detection Zenodo Google Dataset Search
Webis-QSeC-10 Webis Group 2010 2 MB - None Query Segmentation Zenodo Google Dataset Search
Webis-QSpell-17 Webis Group 2017 1 MB - None Query Spelling Correction Zenodo Google Dataset Search
Webis-QTM-19 Webis Group 2019 2 MB 200K Queries Query-task mapping Zenodo Google Dataset Search
Webis-Revenue-10 FG Engels, Webis Group 2010 6 MB 1K documents Entity and Relation Extraction Zenodo Google Dataset Search
Webis-SDMbridge-12 Webis Group 2012 58 MB 15K models Simulation Data Mining Zenodo Google Dataset Search
Webis-Sentences-17 Webis Group 2017 200 GB 3B sentences Text statistics Zenodo Google Dataset Search
Webis-SMC-12 Webis Group 2012 123 KB - None Search Mission Detection Zenodo Google Dataset Search
Webis-Snippet-20 Webis Group 2018 11 GB 10M snippet-webpage pairs Abstractive Snippet Generation Zenodo Google Dataset Search
Webis-TLDR-17 Webis Group 2017 2 GB 4M content-summary pairs Text Summarization Zenodo Google Dataset Search
Webis-TRC-12 Webis Group 2012 120 MB 150 interaction logs Text Reuse Detection, Paraphrasing, and Exploratory Search Zenodo Google Dataset Search
Webis-Tripad-13-Sentiment Webis Group 2013 3 MB 2K reviews Sentiment Analysis Zenodo Google Dataset Search
Webis-Tripad-14 Webis Group 2014 61 MB 266K reviews Sentiment Analysis and Author Profiling Zenodo Google Dataset Search
Webis-Voice-based-and-Conversational-Argument-Search-20 Webis Group 2020 350 KB 500 participants Conversational Analysis (spoken) Zenodo Google Dataset Search
Webis-Web-Archive-17 Webis Group 2017 94 GB 1M documents Web Analysis Zenodo Google Dataset Search
Webis-WikiDebate-18 Webis Group 2018 78 MB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-WikiDiscussions-18 Webis Group 2018 4 GB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-Wikipedia-Text-Reuse-18 Webis Group 2018 - - text segments Text Reuse Analysis Zenodo Google Dataset Search
Webis-WVC-07 Webis Group 2007 12 KB 1K documents Vandalism Detection Zenodo Google Dataset Search
Internal Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Arxiv Webis Group - 674 MB 550 documents -
Bauphysik Webis Group 2010 70 MB - None Vertical Search
Converter Testfiles Webis Group - 2 GB - None -
Genre Corpus (2008) Webis Group 2008 26 MB 2K documents Web Genre Analysis
German Newsgroups Webis Group - 54 MB 27K documents Cluster Analysis
Google News Crawl Webis Group - 404 MB 35K documents -
Gutenberg Wordcount Webis Group - 4 MB - None -
Netspeak Dictionary Webis Group - 3 GB - None -
ODP Cluster Labeling Webis Group 2010 - 6K documents Cluster Labeling
Slashdot Webis Group - 3 GB - None -
TLDP Crawl Webis Group - 366 MB 15K documents -
Twitter Movie Sentiments Webis Group 2010 1 GB - None Sentiment Analysis
Webdiversity Webis Group - 225 MB - None -
Webis-CSP-15 Webis Group 2015 90 GB 30K documents Clustering/Cluster Labeling
Wikipedia Editwars Webis Group 2008 919 MB - None Editwar Detection
Yandex Question Queries Webis Group 2012 200 GB 2B queries -
Youtube Comments Webis Group - 2 GB 324K documents -
Affiliated Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Burrows Authorship Corpora Steven Burrows, RMIT University 2010 8 MB - None Source Code Authorship Attribution
CompArg Comparative Sentences 2019 Universit�t Hamburg 2019 3 MB - None Comparative Sentences Classification Zenodo Google Dataset Search
Dagstuhl-15512-ArgQuality Dagstuhl-15512 Quality breakout group 2018 1 MB 304 arguments Computational Argumentation Zenodo
Paderborn Genre Analysis Corpus 2012 Baumann, Lettmann, Stein 2012 20 MB - None Web Genre Analysis Zenodo Google Dataset Search
Scientific Author's Writing Style Corpus 2017 Rexha, Kr�ll, Ziak, Kern 2017 - 66 cases Authorship Attribution Zenodo Google Dataset Search
Other Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
20 Newsgroups Carnegie Mellon University 1999 18 MB documents 20K Text Classification, Text Clustering
7Sectors-WebKB CMU World Wide Knowledge Base 2001 6 MB documents 5K Text Classification, Text Clustering
A Corpus of Plagiarised Short Answers University of Sheffield 2009 80 KB documents 100 Plagiarism Detection
ABCD (Agreement By Create Debaters) Sara Rosenthal 2015 42 MB dialogues 10K Conversation Analysis (written, human-human)
Annotated Customer Reviews Simon Fraser University Burnaby 2004 870 KB None - Sentiment Analysis
AOL Query Log AOL 2006 2 GB queries 112M Query Log Analysis
Araucaria Argumentation Corpus University of Dundee 2014 9 MB examples 664 Computational Argumentation
Arguing Subjectivity Corpus University of Pittsburgh 2012 732 KB documents 84 Computational Argumentation
Argument Annotated Essays, v1 TU Darmstadt 2014 7 MB essays 90 Computational Argumentation
Argument Annotated Essays, v2 TU Darmstadt 2016 6 MB essays 402 Computational Argumentation
Argumentative Microtext Corpus, parts 1 + 2 Potsdam University 2018 7 MB texts 290 Computational Argumentation
Arxiv-PubMed Corpus Georgetown University 2018 4.2 GB article-abstract pairs 350K Text Summarization, Scientific Document Summarization
AWTP (Agreement in Wikipedia Talk Pages) Sara Rosenthal 2012 235 KB dialogues 822 Conversation Analysis (written, human-human)
Bergsma-Wang-Corpus 2007 S. Bergsma and Q. I. Wang 2007 2 MB queries 2K Web Search Analysis
BigPatent Summarization Corpus Khoury College of Computer Sciences 2019 6 GB article-summary pairs (US patents) 1M Text Summarization
BLOGS06 test collection University of Glasgow 2006 - documents 4M Link Analysis
BNC Writing Errors J. Wagner et al. 2007 274 MB None - Writing Error Detection
British National Corpus (XML) BNC Consortium 2007 5 GB texts 4K Text Analysis (English)
Brown Corpus Brown University 2011 22 MB documents 500 Text Analysis (English)
CEEAUS 2010 Beta Edition Kobe University 2010 - documents 2K Cross-Language Analysis
Change My View Modes Columbia University 2017 - discussion threads 78 Computational Argumentation
CLEANEVAL 2007 University of Trento and University of Leeds 2007 15 MB documents 1K Main Content Extraction
CLEF-IP 2009 Information Retrieval Facility Society (IRF) 2009 14 GB documents 2M Patent Retrieval
CLEF-IP 2010 Information Retrieval Facility Society (IRF) 2010 9 GB documents 3M Patent Retrieval
ClueWeb09 Carnegie Mellon University 2009 4 TB web pages 1B Web Mining
ClueWeb12 Carnegie Mellon University 2012 5 TB web pages 733M Web Mining
CNN-DailyMail IBM 2016 1 GB article-summary pairs 200K Text Summarization
CoNLL-2003 University of Antwerpen 2003 12 MB None - Named Entity Recognition
CoPhIR Consiglio Nazionale delle Ricerche (ISTI-CNR) 2003 54 GB images 106M Image Retrieval
CORE The Open University 2018 330 GB documents 123M Data Mining
DBLP University of Massachusetts Amherst 2006 910 MB None - Network Analysis
Dbpedia 3.5 DBpedia 2010 8 GB None - Data Mining
DMOZ Open Directory Project 2010 11 GB None - Clustering and Clusterlabeling and Data Mining
DoQA Ixa 2020 4 MB dialogues 2437 Conversation Analysis (written, human-human)
ECML PKDD Discovery Challenge 2008 ECML 2008 304 MB lines 17M Collaborative Filtering and Spam Detection
ESL 123 Mass Noun Examples Microsoft Corporation 2006 204 KB sentences 123 Cross-Language Analysis
Essay Argument Strength UT Dallas 2015 30 KB scores 1K Essay scoring
Essay Organization UT Dallas 2010 30 KB scores 1K Essay scoring
Essay Prompt Adherence UT Dallas 2014 38 KB scores 830 Essay scoring
Essay Thesis Clarity UT Dallas 2013 6 MB scores 830 Essay scoring
Europarl (v1 & v3) University of Edinburgh 2007 3 GB None - Machine Translation
European Corpus Initiative Multilingual Corpus I European Corpus Initiative 1994 824 MB words 49M Text Analysis (Multilingual)
Falko Essaykorpus L2 V2 Institut für deutsche Sprache und Linguistik 2005 5 MB documents 248 Interlanguage Analysis
Finegrained Sentiment Uppsala University 2011 4 MB reviews 294 Sentiment Analysis
General Inquirer Dictionary Harvard University 1966 4 MB categories 182 Sentiment Analysis
Google Books N-Gram 20090715 Google 2009 898 GB None - Data Mining
Google Web 1T 5-gram Version 1 Google 2006 55 GB n-grams 5B Text Analysis (English)
IBM Debater - Claim Sentences Search IBM 2018 600 MB topic conclusion pairs 2M Argument Search
IBM Debater - Claim Stance Dataset IBM 2017 8 MB topic conclusion 2K Stance Classification
IBM Debater - Claims and Evidence, ACL-14 IBM 2014 3 MB topic argument pairs 1K Argument Mining
IBM Debater - Claims and Evidence, EMNLP-2015 IBM 2015 8 MB topic argument pairs 5K Argument Mining
IBM Debater - Evidence Sentences IBM 2018 3 MB topic premise pairs 6K Argument Search
IBM Debater - IBM-ArgQ-Rank-30kArgs IBM 2019 2 MB arguments 30K Argument Quality
IBM Debater - Mention Detection Benchmark IBM 2018 2 MB sentences 3K Mention Detection
IBM Debater - Recorded Debating Dataset IBM 2018 2 MB discussions 60 Computational Argumentation
IBM Debater - Sentiment Composition Lexicon IBM 2018 10 MB words 66K Sentiment Analysis
IBM Debater - Sentiment Lexicon of Idiomatic Expressions IBM 2018 3 MB phrases 5K Sentiment Analysis
IBM Debater - TR9856 IBM 2015 2 MB phrase pairs 10K Semantic Relatedness
IBM Debater - Wikipedia Category Stance IBM 2018 1 MB wikipedia category 5K Stance Classification
IBM Debater - Word IBM 2018 4 MB wikipedia concept pairs 19K Semantic Relatedness
ICWSM 2009 Data Challenge ICWSM 2009 37 GB None - Network Analysis
imat2009 dataset Yandex 2009 650 MB None - Machine-learned Ranking
Intelligence Squared Debates (IQ2) Zhang et al. 2016 4 MB dialogues 108 Conversation Analysis (spoken, human-human)
International Corpus of Learner English v2 Center for English Corpus Linguistics 2009 92 MB documents 6K Language Analysis
Internet Argument Corpus v2 NLDS@UC Santa Cruz 2016 3 GB dialogues 11K Conversation Analysis (written, human-human)
IP2Location LITE databases 2016-20 IP2Location 2016 5 GB years 5 IP-geolocation and proxies
Key-value Retrieval Dataset Stanford University 2017 1 MB dialogues 3K Conversation Analysis (written, human-wizard)
Koppel Authorship Corpus M. Koppel and J. Schler 2004 4 MB None - Authorship Verification
Learning To Rank 3 Microsoft 2008 8 GB None - Machine-learned Ranking
Lee 50 Documents M. D. Lee et al. 2005 130 KB documents 50 Text Similarity Analysis
Maluuba Frames Maluuba (Microsoft) 2017 4 MB dialogues 1K Conversation Analysis (written, human-wizard)
MANtIS Lambda-Lab at TU Delft 2019 6 GB dialogues 80K Conversation Analysis (written, human-human)
MEDLINE-PubMed Corpus University of Zurich 2018 7 GB article-abstract & abstract-title pairs 5M Text Summarization, Scientific Document Summarization
METER Corpus Department of Journalism and Department of Computer Science at Sheffield University 2002 10 MB None - Text Reuse
MIR Flickr 2008 LIACS Medialab at Leiden University, Netherlands 2008 3 GB documents 25K Image Retrieval
MISC Microsoft 2017 23 GB dialogues 110 Conversation Analysis (spoken, human-human)
Montclair Electronic Language Database Montclair State University 2001 56 KB documents 33 Cross-Language Analysis
Movie Review Data Cornell University 2004 219 MB reviews 12K Sentiment Analysis
Movielens University of Minnesota 1998 74 MB ratings 11M Collaborative Filtering
MPC (Multi-Party Chat) Shaikh et al. 2010 2 MB dialogues 14 Conversation Analysis (written, human-human)
MSMARCO Conversational Search Microsoft 2019 1 GB synthetic search sessions 2M Next Query Prediction
Multi Domain Sentiment Dataset (Processed ACL) John Hopkins University 2007 29 MB None - Sentiment Analysis
Multi-News Yale University 2019 676 MB multiple articles-summary pairs 54K Text Summarization, Multi-document
MultiWOZ 2.1 M. Eric et al. 2020 19 MB dialogues 10K Conversation Analysis (written, human-wizard)
NBC 2016 Russian Troll Tweets NBC 2018 34 MB tweets 267K Propaganda detection
Netflix Challenge (Partial) Netflix 2006 2 GB None - Collaborative Filtering
New York Times Corpus New York Times 2008 3 GB articles 2M Text Mining
Newsroom Cornell University 2018 5 GB article-summary pairs 1.3M Text Summarization
ODP239 C. Carpineto and G. Romano 2009 5 MB None - Subtopic Information Retrieval
OHSUMED Test Collection Oregon Health & Science University 1994 461 MB None - Text Clustering
OpenWebText Corpus Brown University 2019 40 GB documents 8M Language Modeling, Text Synthesis
OPUS (Europarl3_0b and EMEA0) Jörg Tiedemann 2009 9 GB languages 22 Machine Translation
OR-QuAC C. Qu et al. 2020 10 GB dialogues 6K Conversation Analysis (written, human-wizard), Question Answering
QuAC E. Choi et al. 2018 75 MB dialogues 14K Conversation Analysis (written, human-wizard), Question Answering
RadioTalk Laboratory for Social Machines, MIT Media Lab 2019 9 GB words 3B Language Analysis
Reason Identification and Classification Dataset UT Dallas 2014 4 MB None - Computational Argumentation
Reddit TIFU corpus Seoul National University 2019 640 MB content-summary pairs 123K Text Summarization
Request For Comments Collections (to 4501) RFC Editor 2008 55 MB documents 4K Data Mining
Reuters 21578 (22173) Reuters, David D. Lewis 1996 8 MB articles 22K Text Clustering
Reuters RCV1 Reuters, David D. Lewis 2000 1 GB documents 365 Text Clustering
Reuters RCV1 - CCAT split Reuters, David D. Lewis 2002 2 GB None - Machine Learning
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection National Research Council of Canada 2009 166 MB None - Cross-Language Categorization
Rovereto Twitter N-Gram Corpus University of Trento, Italy 2011 5 GB tweets 75M Social Network Analysis
ScisummNet Corpus Yale University 2019 15 MB scientific paper-summary pairs (with citation networks) 1000 Text Summarization, Scientific Document Summarization
SILS Learner Corpus of English Waseda University 2007 16 MB None - Cross-Language Analysis
SMS Spam Collection v T. A. Almeida and J. M. G. Hidalgo 2011 210 KB messages 6K Spam Identification
Spoken Conversational Search Data Set J.R. Trippas et al. 2017 260 KB dialogues 101 Conversation Analysis (written, human-human)
Spotify Podcasts Dataset Clifton et al. 2020 2 TB hours 50K Conversation Analysis (spoken, human-human)
TED-LIUM Release 3 Ubiqus and LIUM 2018 50 GB hours 452 Speech Recognition
The JRC-Acquis Multilingual Parallel Corpus (3) European Commission's Office for Official Publications (OPOCE) 2009 2 GB None - Cross-Language Research
TIPSTER Complete Advanced Research Projects Agency 1993 1 MB None - Information Retrieval
Topical Chat Dataset Amazon 2019 76 MB dialogues 11K Conversation Analysis (written, human-human)
TREC vol4 National Institute of Standards and Technology (NIST) 1996 436 MB documents 295K Data Mining
TREC vol5 National Institute of Standards and Technology (NIST) 1997 389 MB documents 260K Data Mining
TREC web National Institute of Standards and Technology (NIST) 1999 90 GB None - Data Mining
TripAdvisor Data Set University of Illinois at Urbana-Champaign 2010 220 MB None - Opinion Mining
Tswana Learner English Corpus Center for Text Technology 2006 2 MB None - Cross-Language Analysis
Twitter tweets Yang and Leskovec 2011 26 GB tweets 467M Social Network Analysis
Twitter tweets (RecSys Challenge) Twitter 2020 76 GB tweets 160M Social Network Analysis
UKPConvArg1 TU Darmstadt 2016 21 MB argument pairs 16K Computational Argumentation
UKPConvArg2 TU Darmstadt 2016 23 MB argument pairs 9K Computational Argumentation
Uppsala Student English Uppsala University 2001 3 MB documents 2K Cross-Language Analysis
US Bill Summarization Corpus FiscalNote Research 2019 64 MB article-summary pairs (US bills) 22K Text Summarization
USPTO Patents from 2001 to 2010 U.S. Patent & Trademark Office 2010 10 TB None - Patent Analysis
VQuAnDa Kacupaj et al. 2020 2 MB question-answer-SPARQL query triplets 5K Answer Verbalization
WaCKy: deWaC Web-As-Corpus Kool Yinitiative 2009 26 GB words 2B Text Analysis (German)
WaCKy: frWaC Web-As-Corpus Kool Yinitiative 2009 5 GB words 2B Text Analysis (French)
WaCKy: itWaC Web-As-Corpus Kool Yinitiative 2009 31 GB words 2B Text Analysis (Italian)
WaCKy: sdeWaC Web-As-Corpus Kool Yinitiative 2009 20 GB words 1B Text Analysis (German)
WaCKy: ukWaC Web-As-Corpus Kool Yinitiative 2009 15 GB words 2B Text Analysis (English)
WaCKy: WaCkypedia_EN Web-As-Corpus Kool Yinitiative 2009 6 GB words 1B Text Analysis (English)
WCEP MDS Dataset: Wikipedia Current Events Portal Aylien Ltd., Dublin, Ireland 2020 2 GB document clusters with one human-written summary per cluster 2.39M Text Summarization, Multi-document
Web People Search Corpus (WePS-1) NLP Group (UNED), Proteus Project (NYU) 2007 295 MB web pages 2K Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-2) NLP Group (UNED), Proteus Project (NYU) 2009 328 MB web pages 3K Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-3) NLP Group (UNED), Proteus Project (NYU) 2010 571 MB web pages 50K Person Disambiguation, Text Clustering
WikiHow Summarization Corpus University of California 2018 2 GB article-summary, paragraph-summary pairs 230K Text Summarization
Wikipedia Full Dump Wikimedia Foundation 2011 5 TB None - Data Mining
Wikipedia History Snapshots Wikimedia Foundation 2006 32 GB None - Data Mining
Wikipedia Participation Challenge Wikimedia Foundation 2011 976 MB None - User Behaviour Prediction
Wikipedia Revision Dump Wikimedia Foundation 2006 46 GB None - Data Mining
Wikipedia Revision Dump Wikimedia Foundation 2008 133 GB None - Data Mining
Wikipedia Snapshots Wikimedia Foundation 2006 280 GB None - Data Mining
Wordsim353 L. Finkelstein et al. 2002 60 KB word pairs 353 Word Similarities
Wortschatz Leipzig Universität Leipzig 2006 8 GB languages 15 Text Analysis (Multilingual)
XSum Corpus University of Edinburgh 2018 240 MB article-summary pairs 214K Text Summarization
Yahoo Learning To Rank Challenge 2010 Yahoo 2010 421 MB None - Document Ranking
Yahoo N-Grams Yahoo 2006 13 GB None - Text Analysis (English)