This page organizes all corpora which have resulted from or have been used in our research. The data is made available to Webis-external researchers in various places: (1) corpora that have been officially released by Webis as well as (2) corpora of the PAN and (3) Touché series can be downloaded here, (4) internal Webis corpora (which will be officially released in the future) are supplied upon request, (5) affiliated corpora made available by courtesy of our research partners can be downloaded here, (6) other corpora can be downloaded from their original publisher/creator. Most of our released corpora are hosted at Zenodo Zenodo and are indexed in the Google Dataset Search Google Dataset Search; a few larger corpora are available in the Internet Archive Internet Archive; the –symbol indicates a browsing facility for the respective corpus.

Released Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Arg-Microtexts Synthesis Benchmark Webis group 2018 4 MB 260 arguments Computational Argumentation Zenodo
args.me corpus Webis group 2019 876 MB 388K arguments Computational Argumentation Zenodo Google Dataset Search
ArguAna Counterargs Webis group 2018 106 MB 7K arguments Computational Argumentation Zenodo
ArguAna TripAdvisor Webis group & FG Engels 2014 283 MB 2K reviews Sentiment Analysis Zenodo
BuzzFeed-Webis Fake News Corpus 16 Webis group 2018 5 GB 1K articles News analysis Zenodo Google Dataset Search
CauseNet-20 Webis group & Data Science Group 2020 1.8 GB 11.6M relations Causal Relation Analysis Zenodo
Genre-KI-04 Webis group 2004 11 MB 1K documents Web Genre Analysis Zenodo Google Dataset Search
LFA-11 Webis group & FG Engels 2011 5 MB - Genre and Sentiment Analysis Zenodo Google Dataset Search
WDVC-15 FG Engels & Webis group 2015 5 GB 24M revisions Vandalism Detection Zenodo Google Dataset Search
WDVC-16 FG Engels & Webis group 2016 30 GB 83M revisions Vandalism Detection Zenodo Google Dataset Search
Webis-Ambient-15 Webis group 2015 114 MB 6K documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-ArgImages-21 Webis group 2021 1 MB 3K images Computational Argumentation Zenodo Google Dataset Search
Webis-ArgKB-20 Webis group 2020 1 MB 5K argumentative relations Computational Argumentation Zenodo
Webis-ArgQuality-20 Webis group 2020 3 MB 1K arguments Computational Argumentation Zenodo
Webis-ArgRank-17 Webis group 2017 13 MB 18K arguments Computational Argumentation Zenodo
Webis-Argument-Attributes Webis group & DRL Potsdam 2020 1 KB 20 attributes Computational Argumentation
Webis-Argument-Framing-19 Webis group 2019 7 MB 12K arguments Computational Argumentation and Framing Zenodo Google Dataset Search
Webis-ArgValues-22 Webis group 2022 1 MB 5K arguments Computational Argumentation Zenodo Google Dataset Search
Webis-Bias-Flipper-18 Webis group 2018 13 MB 6K documents Natural Language Generation Zenodo Google Dataset Search
Webis-CausalQA-22 Webis group 2022 440 MB 1.1M question-answer pairs Causal Question Answering
Webis-Clickbait-16 Webis group 2016 255 MB 3K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-Clickbait-17 Webis group 2017 - 20K tweets Clickbait Detection Zenodo Google Dataset Search
Webis-Clickbait-22 Webis group 2022 10 MB 5K posts Clickbait Spoiling Zenodo Google Dataset Search
Webis-CLS-10 Webis group 2010 530 MB 800K documents Cross-Language Text Classification Zenodo Google Dataset Search
Webis-CMV-20 Webis group 2020 3 GB - argument pairs Computational Argumentation Zenodo
Webis-CompQuestions-20 Webis group 2020 1 MB 15K questions Comparative Question Classification Zenodo Google Dataset Search
Webis-CompQuestions-22 Webis group 2022 5 MB 31K questions Comparative Question Classification
Webis-ConcluGen-21 Webis group 2021 225 MB 136K argument-conclusion pairs Informative Conclusion Generation, Text Summarization Zenodo Google Dataset Search
Webis-Conversational-Query-Reformulations-21 Webis group 2021 193 KB 3K messages Query classification Zenodo Google Dataset Search
Webis Chatnoir-Copycat 2021 Webis group 2021 90.6 TB 6.7 B documents Duplicate Detection
Webis-CPC-11 Webis group 2011 19 MB 8K paraphrases Plagiarism Detection Zenodo Google Dataset Search
Webis-Debate-16 Webis group 2016 908 KB 27K text segments Computational Argumentation Zenodo Google Dataset Search
Webis-Editorial-Quality-18 Webis group 2018 3 MB 1K documents Computational Argumentation Zenodo Google Dataset Search
Webis-Editorials-16 Webis group 2016 5 MB 300 documents Computational Argumentation Zenodo Google Dataset Search
Webis-EditorialSum-20 Webis group 2020 10 MB 1330 editorials Text Summarization Zenodo Google Dataset Search
Webis-Exhibition-Questions-21 Webis group 2021 34 MB 849 questions Conversational Analysis (written) Zenodo Google Dataset Search
Webis-Gmane-19 Webis group 2019 160 GB 153M emails Dialog Analysis Zenodo Google Dataset Search Internet Archive
Webis-KIQC-13 Webis group 2013 1 MB 3K questions Known-Item Search Zenodo Google Dataset Search
Webis-Mnemonics-17 Webis group 2017 2 MB 1K mnemonics Password analysis Zenodo Google Dataset Search
Webis MS MARCO Anchor Text 2022 Webis group 2022 3.5 GB 6.5 M documents Anchor Text Zenodo
Webis-NIL-21 Webis Group 2021 392 KB 37K log entries Query identification Zenodo Google Dataset Search
Webis-ODP-10 Webis group 2010 113 MB 5M documents Clustering/Cluster Labeling Zenodo Google Dataset Search
Webis-PC-08 Webis group 2008 298 MB - Plagiarism Detection Zenodo Google Dataset Search
Webis-PRA-12 Webis group 2012 884 KB 14K company names Spelling Error Detection Zenodo Google Dataset Search
Webis-QInC-22 Webis group 2022 79 MB 13 MB queries Query Interpretation Zenodo Google Dataset Search
Webis-QSeC-10 Webis group 2010 2 MB - Query Segmentation Zenodo Google Dataset Search
Webis-QSpell-17 Webis group 2017 1 MB - Query Spelling Correction Zenodo Google Dataset Search
Webis-QTM-19 Webis group 2019 2 MB 200K Queries Query-task mapping Zenodo Google Dataset Search
Webis-Revenue-10 FG Engels & Webis group 2010 6 MB 1K documents Entity and Relation Extraction Zenodo Google Dataset Search
Webis-SameSentiment-21 Webis group 2021 43 MB 704K sentiment pair ids Sentiment Analysis Zenodo
Webis-SameSide-19 Webis group 2020 63 MB 125K argument pairs Computational Argumentation Zenodo
Webis-SameSide-21 Webis group 2021 150 MB - argument pairs Computational Argumentation Zenodo
Webis-SameSideAdversarial-21 Webis group 2021 50 KB 175 argument pairs Computational Argumentation Zenodo
Webis-SCSmeta-21 Webis group 2021 25 KB 1K turns Conversational Analysis (spoken) Zenodo Google Dataset Search
Webis-SDMbridge-12 Webis group 2012 58 MB 15K models Simulation Data Mining Zenodo Google Dataset Search
Webis-Sentences-17 Webis group 2017 200 GB 3B sentences Text statistics Zenodo Google Dataset Search
Webis-SMC-12 Webis group 2012 123 KB - Search Mission Detection Zenodo Google Dataset Search
Webis-Snippet-20 Webis group 2020 11 GB 10M snippet-webpage pairs Abstractive Snippet Generation, Text Summarization Zenodo Google Dataset Search
Webis-TLDR-17 Webis group 2017 2 GB 4M content-summary pairs Text Summarization Zenodo Google Dataset Search
Webis-TRC-12 Webis group 2012 120 MB 150 interaction logs Text Reuse Detection, Paraphrasing, and Exploratory Search Zenodo Google Dataset Search
Webis-Tripad-13-Sentiment Webis group 2013 3 MB 2K reviews Sentiment Analysis Zenodo Google Dataset Search
Webis-Tripad-14 Webis group 2014 61 MB 266K reviews Sentiment Analysis and Author Profiling Zenodo Google Dataset Search
Webis-Voice-based-and-Conversational-Argument-Search-20 Webis group 2020 350 KB 500 participants Conversational Analysis (spoken) Zenodo Google Dataset Search
Webis-Web-Archive-17 Webis group 2017 94 GB 10K documents Web Analysis Zenodo Google Dataset Search
Webis-Web-Archive-Quality-22 Webis group 2012 18 GB 7K documents Web Analysis Zenodo Google Dataset Search
Webis-Web-Errors-19 Webis group 2019 1 MB 10K documents Web Analysis Zenodo Google Dataset Search
Webis-WebSeg-20 Webis group 2020 12 GB 8K documents Web Page Segmentation Zenodo Google Dataset Search
Webis-WebSeg-20-Algorithm-Segmentations Webis group 2021 7 GB 246K segmentations Web Page Segmentation Zenodo Google Dataset Search
Webis-WikiDebate-18 Webis group 2018 78 MB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-WikiDiscussions-18 Webis group 2018 4 GB 6M discussions Computational Argumentation Zenodo Google Dataset Search
Webis-Wikipedia-Text-Reuse-18 Webis group 2018 - - text segments Text Reuse Analysis Zenodo Google Dataset Search
Webis-WVC-07 Webis group 2007 12 KB 1K documents Vandalism Detection Zenodo Google Dataset Search
Webis-YouTube8MA-18 Webis Group 2018 169 GB 6M documents Video Retrieval Zenodo Google Dataset Search
PAN Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Alvi15-Text-Alignment-en-fa 2015 2 MB 200 documents Originality Zenodo Google Dataset Search
C10-Attribution 2015 4 MB Author Identification Zenodo Google Dataset Search
C50-Attribution 2015 17 MB Author Identification Zenodo Google Dataset Search
Cheema15-Text-Alignment-en 2015 4 MB Originality Zenodo Google Dataset Search
Hanfi15-Text-Alignment-en-ur 2015 3 MB Originality Zenodo Google Dataset Search
Khoshnavataher15-Text-Alignment-fa 2015 16 MB Originality Zenodo Google Dataset Search
Kong15-Text-Alignment-zh 2015 3 MB Originality Zenodo Google Dataset Search
Mohtaij15-Text-Alignment-en 2015 57 MB Originality Zenodo Google Dataset Search
Palkovskii15-Text-Alignment-en 2015 26 MB Originality Zenodo Google Dataset Search
PAN-PC-09 Webis group 2009 2 GB 41K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-10 Webis group 2010 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-PC-11 Webis group 2011 2 GB 27K documents Plagiarism Detection Zenodo Google Dataset Search
PAN-SemEval-Hyperpartisan-News-Detection-19 Webis & Factmata 2018 1 GB 751K articles Hyperpartisan News Detection Zenodo Google Dataset Search
PAN-WQF-12 Webis group 2012 4 GB 2M documents Quality Flaw Prediction Zenodo Google Dataset Search
PAN-WVC-10 Webis group 2010 439 MB 32K documents Vandalism Detection Zenodo Google Dataset Search
PAN-WVC-11 Webis group 2011 371 MB 24K documents Vandalism Detection Zenodo Google Dataset Search
PAN11-Attribution 2011 3 MB Author Identification Zenodo Google Dataset Search
PAN12-Attribution 2012 9 MB Author Identification Zenodo Google Dataset Search
PAN12-Sexual-Predator-Identification 2012 92 MB Deception Detection Zenodo Google Dataset Search
PAN12-Source-Retrieval 2012 1 MB Originality Zenodo Google Dataset Search
PAN12-Text-Alignment 2012 783 MB Originality Zenodo Google Dataset Search
PAN13-Author-Profiling 2013 713 MB Author Profiling Zenodo Google Dataset Search
PAN13-Source-Retrieval 2013 3 MB Originality Zenodo Google Dataset Search
PAN13-Text-Alignment 2013 35 MB Originality Zenodo Google Dataset Search
PAN13-Verification 2013 1 MB Author Identification Zenodo Google Dataset Search
PAN14-Author-Profiling 2014 205 MB Author Profiling Zenodo Google Dataset Search
PAN14-Source-Retrieval 2014 7 MB Originality Zenodo Google Dataset Search
PAN14-Text-Alignment 2014 22 MB Originality Zenodo Google Dataset Search
PAN14-Verification 2014 9 MB Author Identification Zenodo Google Dataset Search
PAN15-Author-Profiling 2015 2 MB Author Profiling Zenodo Google Dataset Search
PAN15-Source-Retrieval 2015 7 MB Originality Zenodo Google Dataset Search
PAN15-Verification 2015 3 MB Author Identification Zenodo Google Dataset Search
PAN16-Author-Masking PAN 2016 2 MB 205 cases Author Obfuscation Zenodo Google Dataset Search
PAN16-Author-Profiling 2016 2 MB Author Profiling Zenodo Google Dataset Search
PAN16-Clustering 2016 3 MB Author Identification Zenodo Google Dataset Search
PAN17-Author-Profiling 2017 254 MB Author Profiling Zenodo Google Dataset Search
PAN17-Clustering 2017 1 MB Author Identification Zenodo Google Dataset Search
PAN17-Style-Change-Detection 2017 8 MB Multi-Author Analysis Zenodo Google Dataset Search
PAN18-Attribution 2018 4 MB 2K cases Author Identification Zenodo Google Dataset Search
PAN18-Author-Profiling PAN 2018 7 GB 8K cases Author Profiling Zenodo Google Dataset Search
PAN18-Style-Change-Detection 2018 8 MB 3K cases Multi-Author Analysis Zenodo Google Dataset Search
PAN19-Attribution 2019 13 MB Author Identification Zenodo Google Dataset Search
PAN19-Bots-and-Gender-Profiling 2019 38 MB Author Profiling Zenodo Google Dataset Search
PAN19-Celebrity-Profiling 2019 3 GB Author Profiling Zenodo Google Dataset Search
PAN19-Style-Change-Detection 2019 10 MB Multi-Author Analysis Zenodo Google Dataset Search
PAN20-Celebrity-Profiling 2020 7 GB Author Profiling Zenodo Google Dataset Search
PAN20-Profiling-Fake-News-Spreaders-in-Twitter 2020 8 MB Author Profiling Zenodo Google Dataset Search
PAN20-Style-Change-Detection 2020 98 MB Multi-Author Analysis Zenodo Google Dataset Search
PAN20-Authorship-Verification 2020 838 MB Authorship Verification Zenodo Google Dataset Search
PAN20-Authorship-Verification (Large) 2020 4 GB Authorship Verification Zenodo Google Dataset Search
PAN21-Authorship-Verification 2021 322 MB Authorship Verification Zenodo Google Dataset Search
PAN21-Style-Change-Detection 2021 19.2 MB Multi-Author Analysis Zenodo
PAN21-Profiling-Hate-Speech-Spreaders-on-Twitter 2021 2.8 MB Author Profiling Zenodo
PAN22-Authorship-Verification 2022 23 MB Authorship Verification Zenodo Google Dataset Search
Profiling-Irony-and-Stereotype-Spreaders-on-Twitter 2022 5.7 MB Author Profiling Zenodo
Touché Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Touché20-Argument-Retrieval-for-Comparative-Questions Webis group 2020 3 MB 50 topics Argument search Zenodo
Touché20-Argument-Retrieval-for-Controversial-Questions Webis group 2020 9 MB 50 topics Argument search Zenodo
Touché21-Argument-Retrieval-for-Comparative-Questions Webis group 2021 200 KB 50 topics Argument search Zenodo
Touché21-Argument-Retrieval-for-Controversial-Questions Webis group 2021 1 MB 50 topics Argument search Zenodo
Touché22-Argument-Retrieval-for-Comparative-Questions Webis group 2022 700 MB 50 topics Argument search Zenodo
Touché22-Argument-Retrieval-for-Controversial-Questions Webis group 2022 2 GB 50 topics Argument search Zenodo
Touché22-Image-Retrieval-for-Arguments Webis group 2022 169 GB 50 topics Argument search Zenodo
Touché23-Human-Value-Detection Webis group 2022 1 MB 5K arguments Computational Argumentation Zenodo
Internal Webis Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Arxiv Webis group - 674 MB 550 documents -
Bauphysik Webis group 2010 70 MB - Vertical Search
Converter Testfiles Webis group - 2 GB - -
Genre Corpus (2008) Webis group 2008 26 MB 2K documents Web Genre Analysis
German Newsgroups Webis group - 54 MB 27K documents Cluster Analysis
Google News Crawl Webis group - 404 MB 35K documents -
Gutenberg Wordcount Webis group - 4 MB - -
Netspeak Dictionary Webis group - 3 GB - -
ODP Cluster Labeling Webis group 2010 - 6K documents Cluster Labeling
Slashdot Webis group - 3 GB - -
TLDP Crawl Webis group - 366 MB 15K documents -
Twitter Movie Sentiments Webis group 2010 1 GB - Sentiment Analysis
Webdiversity Webis group - 225 MB - -
Webis-CSP-15 Webis group 2015 90 GB 30K documents Clustering/Cluster Labeling
Wikipedia Editwars Webis group 2008 919 MB - Editwar Detection
Yandex Question Queries Webis group 2012 200 GB 2B queries -
Youtube Comments Webis group - 2 GB 324K documents -
Affiliated Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
Burrows Authorship Corpora Steven Burrows, RMIT University 2010 8 MB - Source Code Authorship Attribution
Common Crawl Common Crawl organization 2009-2021 (+) 1.7 PB 3M WARC files Web Analysis
CompArg: Comparative Sentences 2019 Universität Hamburg 2019 3 MB - Comparative Sentences Classification Zenodo Google Dataset Search
Dagstuhl-15512-ArgQuality Dagstuhl-15512 Quality breakout group 2017 1 MB 304 arguments Computational Argumentation Zenodo
Internet Archive Internet Archive organization 350 TB 800K WARC files Web Analysis
Paderborn Genre Analysis Corpus 2012 Baumann, Lettmann, Stein 2012 20 MB - Web Genre Analysis Zenodo Google Dataset Search
Scientific Author's Writing Style Corpus 2017 Rexha, Kröll, Ziak, Kern 2017 - 66 cases Authorship Attribution Zenodo Google Dataset Search
Other Corpora
Name Publisher/Creator Year Size [bytes] Size [units] Default Task Access
20 Newsgroups Carnegie Mellon University 1999 18 MB 20K documents Text Classification, Text Clustering
7Sectors-WebKB CMU World Wide Knowledge Base 2001 6 MB 5K documents Text Classification, Text Clustering
A Corpus of Plagiarised Short Answers University of Sheffield 2009 80 KB 100 documents Plagiarism Detection
ABCD (Agreement By Create Debaters) Sara Rosenthal 2015 42 MB 10K dialogues Conversation Analysis (written, human-human)
AgreeSum New York University 2021 12 MB 18K multiple articles-summary pairs Text Summarization, Multi-document
AWTP (Agreement in Wikipedia Talk Pages) Sara Rosenthal 2012 235 KB 822 dialogues Conversation Analysis (written, human-human)
All The News Kaggle 2020 3.1 GB 2.7M news articles Text Summarization, Text Analysis
Annotated Customer Reviews Simon Fraser University Burnaby 2004 870 KB - Sentiment Analysis
Any-Aspect Summarization Carnegie Mellon University 2020 1.5 GB 280K article-summary pairs Text Summarization
AOL Query Log AOL 2006 2 GB 112M queries Query Log Analysis
Argument Annotated Essays, v1 TU Darmstadt 2014 7 MB 90 essays Computational Argumentation
Argument Annotated Essays, v2 TU Darmstadt 2016 6 MB 402 essays Computational Argumentation
Araucaria Argumentation Corpus University of Dundee 2014 9 MB 664 examples Computational Argumentation
Arguing Subjectivity Corpus University of Pittsburgh 2012 732 KB 84 documents Computational Argumentation
Arxiv-PubMed Corpus Georgetown University 2018 4.2 GB 350K article-abstract pairs Text Summarization, Scientific Document Summarization
Bergsma-Wang-Corpus 2007 S. Bergsma and Q. I. Wang 2007 2 MB 2K queries Web Search Analysis
BigPatent Summarization Corpus Khoury College of Computer Sciences 2019 6 GB 1M article-summary pairs (US patents) Text Summarization
Bill Summarization Corpus FiscalNote Research 2019 64 MB 22K article-summary pairs (US bills) Text Summarization
BLOGS06 test collection University of Glasgow 2006 - 4M documents Link Analysis
BNC Writing Errors J. Wagner et al. 2007 274 MB - Writing Error Detection
British National Corpus (XML) BNC Consortium 2007 5 GB 4K texts Text Analysis (English)
Brown Corpus Brown University 2011 22 MB 500 documents Text Analysis (English)
Change My View Modes Columbia University 2017 - 78 discussion threads Computational Argumentation
CEEAUS 2010 Beta Edition Kobe University 2010 - 2K documents Cross-Language Analysis
CLEANEVAL 2007 University of Trento and University of Leeds 2007 15 MB 1K documents Main Content Extraction
CLEF-IP 2009 Information Retrieval Facility Society (IRF) 2009 14 GB 2M documents Patent Retrieval
CLEF-IP 2010 Information Retrieval Facility Society (IRF) 2010 9 GB 3M documents Patent Retrieval
ClueWeb09 Carnegie Mellon University 2009 4 TB 1B web pages Web Mining
ClueWeb12 Carnegie Mellon University 2012 5 TB 733M web pages Web Mining
CNN-DailyMail IBM 2016 1 GB 200K article-summary pairs Text Summarization
CoNLL-2003 University of Antwerpen 2003 12 MB - Named Entity Recognition
ConvoSumm Corpus Yale University 2021 650 MB 500 comments-summary pairs Text Summarization, Dialogue Summarization
CoPhIR Consiglio Nazionale delle Ricerche (ISTI-CNR) 2003 54 GB 106M images Image Retrieval
CORE The Open University 2018 330 GB 123M documents Data Mining
DBLP University of Massachusetts Amherst 2006 910 MB - Network Analysis
Dbpedia 3.5 DBpedia 2010 8 GB - Data Mining
DialogSum Corpus Zhejiang University 2021 4 MB 13K dialogue-summary pairs with topics Text Summarization, Dialogue Summarization
DMOZ Open Directory Project 2010 11 GB - Clustering and Clusterlabeling and Data Mining
DoQA Ixa 2020 4 MB 2437 dialogues Conversation Analysis (written, human-human)
ECML PKDD Discovery Challenge 2008 ECML 2008 304 MB 17M lines Collaborative Filtering and Spam Detection
ESL 123 Mass Noun Examples Microsoft Corporation 2006 204 KB 123 sentences Cross-Language Analysis
Essay Argument Strength UT Dallas 2015 30 KB 1K scores Essay scoring
Essay Organization UT Dallas 2010 30 KB 1K scores Essay scoring
Essay Prompt Adherence UT Dallas 2014 38 KB 830 scores Essay scoring
Essay Thesis Clarity UT Dallas 2013 6 MB 830 scores Essay scoring
Finegrained Sentiment Uppsala University 2011 4 MB 294 reviews Sentiment Analysis
European Corpus Initiative Multilingual Corpus I European Corpus Initiative 1994 824 MB 49M words Text Analysis (Multilingual)
Europarl (v1 & v3) University of Edinburgh 2007 3 GB - Machine Translation
Falko Essaykorpus L2 V2 Institut für deutsche Sprache und Linguistik 2005 5 MB 248 documents Interlanguage Analysis
General Inquirer Dictionary Harvard University 1966 4 MB 182 categories Sentiment Analysis
Google Books N-Gram 20090715 Google 2009 898 GB - Data Mining
Google Web 1T 5-gram Version 1 Google 2006 55 GB 5B n-grams Text Analysis (English)
IBM Debater- Claim Sentences Search IBM 2018 600 MB 2M topic conclusion pairs Argument Search
IBM Debater- Evidence Sentences IBM 2018 3 MB 6K topic premise pairs Argument Search
IBM Debater- Claims and Evidence, EMNLP-2015 IBM 2015 8 MB 5K topic argument pairs Argument Mining
IBM Debater- Claims and Evidence, ACL-14 IBM 2014 3 MB 1K topic argument pairs Argument Mining
IBM Debater- Claim Stance Dataset IBM 2017 8 MB 2K topic conclusion Stance Classification
IBM Debater- Sentiment Lexicon of Idiomatic Expressions IBM 2018 3 MB 5K phrases Sentiment Analysis
IBM Debater- Sentiment Composition Lexicon IBM 2018 10 MB 66K words Sentiment Analysis
IBM Debater- Wikipedia Category Stance IBM 2018 1 MB 5K wikipedia category Stance Classification
IBM Debater- Word IBM 2018 4 MB 19K wikipedia concept pairs Semantic Relatedness
IBM Debater- TR9856 IBM 2015 2 MB 10K phrase pairs Semantic Relatedness
IBM Debater- Mention Detection Benchmark IBM 2018 2 MB 3K sentences Mention Detection
IBM Debater- Recorded Debating Dataset IBM 2018 2 MB 60 discussions Computational Argumentation
ICWSM 2009 Data Challenge ICWSM 2009 37 GB - Network Analysis
imat2009 dataset Yandex 2009 650 MB - Machine-learned Ranking
Intelligence Squared Debates (IQ2) Zhang et al. 2016 4 MB 108 dialogues Conversation Analysis (spoken, human-human)
International Corpus of Learner English v2 Center for English Corpus Linguistics 2009 92 MB 6K documents Language Analysis
Internet Argument Corpus v2 [email protected] Santa Cruz 2016 3 GB 11K dialogues Conversation Analysis (written, human-human)
IP2Location LITE databases 2016-20 IP2Location 2016-2019 5 GB 5 years IP-geolocation and proxies
The JRC-Acquis Multilingual Parallel Corpus (3) European Commission's Office for Official Publications (OPOCE) 2009 2 GB - Cross-Language Research
Topical Chat Dataset Amazon 2019 76 MB 11K dialogues Conversation Analysis (written, human-human)
Key-value Retrieval Dataset Stanford University 2017 1 MB 3K dialogues Conversation Analysis (written, human-wizard)
Koppel Authorship Corpus M. Koppel and J. Schler 2004 4 MB - Authorship Verification
Learning To Rank 3 Microsoft 2008 8 GB - Machine-learned Ranking
Lee 50 Documents M. D. Lee et al. 2005 130 KB 50 documents Text Similarity Analysis
Maluuba Frames Maluuba (Microsoft) 2017 4 MB 1K dialogues Conversation Analysis (written, human-wizard)
MANtIS Lambda-Lab at TU Delft 2019 6 GB 80K dialogues Conversation Analysis (written, human-human)
MediaSum Corpus Microsoft Cognitive Services Research Group 2021 1.5 GB 463K interview transcript-summary pairs Text Summarization, Dialogue Summarization
MEDLINE-PubMed Corpus University of Zürich 2018 7 GB 5M article-abstract & abstract-title pairs Text Summarization, Scientific Document Summarization
METER Corpus Department of Journalism and Department of Computer Science at Sheffield University 2002 10 MB - Text Reuse
MIR Flickr 2008 LIACS Medialab at Leiden University, Netherlands 2008 3 GB 25K documents Image Retrieval
MISC Microsoft 2017 23 GB 110 dialogues Conversation Analysis (spoken, human-human)
Movielens University of Minnesota 1998-2009 74 MB 11M ratings Collaborative Filtering
Movie Review Data Cornell University 2004-2005 219 MB 12K reviews Sentiment Analysis
MPC (Multi-Party Chat) Shaikh et al. 2010 2 MB 14 dialogues Conversation Analysis (written, human-human)
MSMARCO Conversational Search Microsoft 2019 1 GB 2M synthetic search sessions Next Query Prediction
Multi Domain Sentiment Dataset (Processed ACL) John Hopkins University 2007 29 MB - Sentiment Analysis
Multilingual Amazon Reviews P. Keung et al. 2020 640 MB 1.3M reviews Text Classification (Multilingual)
Multi-Aspect Summarization Amazon Research 2019 946 MB 280K article-summary pairs Text Summarization
Multi-News Yale University 2019 676 MB 54K multiple articles-summary pairs Text Summarization, Multi-document
MultiWOZ 2.1 M. Eric et al. 2020 19 MB 10K dialogues Conversation Analysis (written, human-wizard)
Multi-XScience Mila 2020 61.3 MB 40K article-summary pairs Text Summarization, Scientific Document Summarization
Montclair Electronic Language Database Montclair State University 2001 56 KB 33 documents Cross-Language Analysis
Netflix Challenge (Partial) Netflix 2006 2 GB - Collaborative Filtering
Newsroom Cornell University 2018 5 GB 1.3M article-summary pairs Text Summarization
New York Times Corpus New York Times 2008 3 GB 2M articles Text Mining
NBC 2016 Russian Troll Tweets NBC 2018 34 MB 267K tweets Propaganda detection
ODP239 C. Carpineto and G. Romano 2009 5 MB - Subtopic Information Retrieval
OHSUMED Test Collection Oregon Health & Science University 1994 461 MB - Text Clustering
OpenWebText Corpus Brown University 2019 40 GB 8M documents Language Modeling, Text Synthesis
OPUS (Europarl3_0b and EMEA0) Jörg Tiedemann 2009 9 GB 22 languages Machine Translation
OR-QuAC C. Qu et al. 2020 10 GB 6K dialogues Conversation Analysis (written, human-wizard), Question Answering
QuAC E. Choi et al. 2018 75 MB 14K dialogues Conversation Analysis (written, human-wizard), Question Answering
RadioTalk Laboratory for Social Machines, MIT Media Lab 2019 9 GB 3B words Language Analysis
Reason Identification and Classification Dataset UT Dallas 2014 4 MB - Computational Argumentation
Reddit TIFU corpus Seoul National University 2019 640 MB 123K content-summary pairs Text Summarization
Reuters 21578 (22173) Reuters, David D. Lewis 1996 8 MB 22K articles Text Clustering
Reuters RCV1 Reuters, David D. Lewis 2000 1 GB 365 documents Text Clustering
Reuters RCV1 - CCAT split Reuters, David D. Lewis 2002 2 GB - Machine Learning
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection National Research Council of Canada 2009 166 MB - Cross-Language Categorization
Request For Comments Collections (to 4501) RFC Editor 2008 55 MB 4K documents Data Mining
Rovereto Twitter N-Gram Corpus University of Trento, Italy 2011 5 GB 75M tweets Social Network Analysis
ScisummNet Corpus Yale University 2019 15 MB 1000 scientific paper-summary pairs (with citation networks) Text Summarization, Scientific Document Summarization
SILS Learner Corpus of English Waseda University 2007 16 MB - Cross-Language Analysis
SMS Spam Collection v T. A. Almeida and J. M. G. Hidalgo 2011 210 KB 6K messages Spam Identification
Spoken Conversational Search Data Set J.R. Trippas et al. 2017 260 KB 101 dialogues Conversation Analysis (written, human-human)
Spotify Podcasts Dataset Clifton et al. 2020 2 TB 50K hours Conversation Analysis (spoken, human-human)
SumPubMed Corpus University of Utah 2021 608 MB 33K scientific paper-summary pairs Text Summarization, Scientific Document Summarization
TED-LIUM Release 3 Ubiqus and LIUM 2018 50 GB 452 hours Speech Recognition
TIPSTER Complete Advanced Research Projects Agency 1993 1 MB - Information Retrieval
TREC vol4 National Institute of Standards and Technology (NIST) 1996 436 MB 295K documents Data Mining
TREC vol5 National Institute of Standards and Technology (NIST) 1997 389 MB 260K documents Data Mining
TREC web National Institute of Standards and Technology (NIST) 1999-2004 90 GB - Data Mining
TripAdvisor Data Set University of Illinois at Urbana-Champaign 2010 220 MB - Opinion Mining
Tswana Learner English Corpus Center for Text Technology 2006 2 MB - Cross-Language Analysis
Twitter tweets Yang and Leskovec 2011 26 GB 467M tweets Social Network Analysis
Twitter tweets (RecSys Challenge) Twitter 2020 76 GB 160M tweets Social Network Analysis
UKPConvArg1 TU Darmstadt 2016 21 MB 16K argument pairs Computational Argumentation
UKPConvArg2 TU Darmstadt 2016 23 MB 9K argument pairs Computational Argumentation
USPTO Patents from 2001 to 2010 U.S. Patent & Trademark Office 2010 10 TB - Patent Analysis
Uppsala Student English Uppsala University 2001 3 MB 2K documents Cross-Language Analysis
VQuAnDa Kacupaj et al. 2020 2 MB 5K question-answer-SPARQL query triplets Answer Verbalization
WaCKy: deWaC Web-As-Corpus Kool Yinitiative 2009 26 GB 2B words Text Analysis (German)
WaCKy: frWaC Web-As-Corpus Kool Yinitiative 2009 5 GB 2B words Text Analysis (French)
WaCKy: itWaC Web-As-Corpus Kool Yinitiative 2009 31 GB 2B words Text Analysis (Italian)
WaCKy: sdeWaC Web-As-Corpus Kool Yinitiative 2009 20 GB 1B words Text Analysis (German)
WaCKy: ukWaC Web-As-Corpus Kool Yinitiative 2009 15 GB 2B words Text Analysis (English)
WaCKy: WaCkypedia_EN Web-As-Corpus Kool Yinitiative 2009 6 GB 1B words Text Analysis (English)
WCEP MDS Dataset: Wikipedia Current Events Portal Aylien Ltd., Dublin, Ireland 2020 2 GB 2.39M document clusters with one human-written summary per cluster Text Summarization, Multi-document
Web People Search Corpus (WePS-1) NLP Group (UNED), Proteus Project (NYU) 2007 295 MB 2K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-2) NLP Group (UNED), Proteus Project (NYU) 2009 328 MB 3K web pages Person Disambiguation, Text Clustering
Web People Search Corpus (WePS-3) NLP Group (UNED), Proteus Project (NYU) 2010 571 MB 50K web pages Person Disambiguation, Text Clustering
WikiHow Summarization Corpus University of California 2018 2 GB 230K article-summary, paragraph-summary pairs Text Summarization
Wikipedia Revision Dump Wikimedia Foundation 2006 46 GB - Data Mining
Wikipedia Revision Dump Wikimedia Foundation 2008 133 GB - Data Mining
Wikipedia Full Dump Wikimedia Foundation 2011 5 TB - Data Mining
Wikipedia History Snapshots Wikimedia Foundation 2006-2012 32 GB - Data Mining
Wikipedia Snapshots Wikimedia Foundation 2006-2012 280 GB - Data Mining
WikiSum Corpus Amazon 2021 115 MB 40K article-summary pairs Text Summarization
Wikipedia Participation Challenge Wikimedia Foundation 2011 976 MB - User Behaviour Prediction
Wordsim353 L. Finkelstein et al. 2002 60 KB 353 word pairs Word Similarities
Wortschatz Leipzig Universität Leipzig 2006 8 GB 15 languages Text Analysis (Multilingual)
XL-Sum Corpus Bangladesh University of Engineering and Technology 2021 1.3 GB 1.35M article-summary pairs Text Summarization, Multilingual Text Summarization
XSum Corpus University of Edinburgh 2018 240 MB 214K article-summary pairs Text Summarization
Yahoo Learning To Rank Challenge 2010 Yahoo 2010 421 MB - Document Ranking
Yahoo N-Grams Yahoo 2006 13 GB - Text Analysis (English)