Data
This page organizes all corpora which have resulted from or have been used in our research.
The data is made available to Webis-external researchers in various places:
(1) corpora that have been officially released by Webis as well as
(2) corpora of the PAN and
(3) Touché series can be downloaded here,
(4) internal Webis corpora (which will be officially released in the future) are supplied upon request,
(5) affiliated corpora made available by courtesy of our research partners can be downloaded here,
(6) other corpora can be downloaded from their original publisher/creator.
Most of our released corpora are hosted at
Zenodo
and are indexed in the
Google Dataset Search
;
a few larger corpora are available in the
Internet Archive
; the
–symbol
indicates a browsing facility for the respective corpus.
Released Webis Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Arg-Microtexts Synthesis Benchmark | Webis group | 2018 | 4 MB | 260 | arguments | Computational Argumentation |
![]() |
|
args.me corpus | Webis group | 2019 | 876 MB | 388K | arguments | Computational Argumentation |
![]() ![]() |
|
ArguAna Counterargs | Webis group | 2018 | 106 MB | 7K | arguments | Computational Argumentation |
![]() |
|
ArguAna TripAdvisor | Webis group & FG Engels | 2014 | 283 MB | 2K | reviews | Sentiment Analysis |
![]() |
|
BuzzFeed-Webis Fake News Corpus 16 | Webis group | 2018 | 5 GB | 1K | articles | News analysis |
![]() ![]() |
|
CauseNet-20 | Webis group & Data Science Group | 2020 | 1.8 GB | 11.6M | relations | Causal Relation Analysis |
![]() |
|
Genre-KI-04 | Webis group | 2004 | 11 MB | 1K | documents | Web Genre Analysis |
![]() ![]() |
|
LFA-11 | Webis group & FG Engels | 2011 | 5 MB | - | Genre and Sentiment Analysis |
![]() ![]() |
||
WDVC-15 | FG Engels & Webis group | 2015 | 5 GB | 24M | revisions | Vandalism Detection |
![]() ![]() |
|
WDVC-16 | FG Engels & Webis group | 2016 | 30 GB | 83M | revisions | Vandalism Detection |
![]() ![]() |
|
Webis-Ambient-15 | Webis group | 2015 | 114 MB | 6K | documents | Clustering/Cluster Labeling |
![]() ![]() |
|
Webis-ArgImages-21 | Webis group | 2021 | 1 MB | 3K | images | Computational Argumentation |
![]() ![]() |
|
Webis-ArgKB-20 | Webis group | 2020 | 1 MB | 5K | argumentative relations | Computational Argumentation |
![]() |
|
Webis-ArgQuality-20 | Webis group | 2020 | 3 MB | 1K | arguments | Computational Argumentation |
![]() |
|
Webis-ArgRank-17 | Webis group | 2017 | 13 MB | 18K | arguments | Computational Argumentation |
![]() |
|
Webis-Argument-Attributes | Webis group & DRL Potsdam | 2020 | 1 KB | 20 | attributes | Computational Argumentation | ||
Webis-Argument-Framing-19 | Webis group | 2019 | 7 MB | 12K | arguments | Computational Argumentation and Framing |
![]() ![]() |
|
Webis-ArgValues-22 | Webis group | 2022 | 1 MB | 5K | arguments | Computational Argumentation |
![]() ![]() |
|
Webis-Bias-Flipper-18 | Webis group | 2018 | 13 MB | 6K | documents | Natural Language Generation |
![]() ![]() |
|
Webis-CausalQA-22 | Webis group | 2022 | 440 MB | 1.1M | question-answer pairs | Causal Question Answering | ||
Webis-Clickbait-16 | Webis group | 2016 | 255 MB | 3K | tweets | Clickbait Detection |
![]() ![]() |
|
Webis-Clickbait-17 | Webis group | 2017 | - | 20K | tweets | Clickbait Detection |
![]() ![]() |
|
Webis-Clickbait-22 | Webis group | 2022 | 10 MB | 5K | posts | Clickbait Spoiling |
![]() ![]() |
|
Webis-CLS-10 | Webis group | 2010 | 530 MB | 800K | documents | Cross-Language Text Classification |
![]() ![]() |
|
Webis-CMV-20 | Webis group | 2020 | 3 GB | - | argument pairs | Computational Argumentation |
![]() |
|
Webis-CompQuestions-20 | Webis group | 2020 | 1 MB | 15K | questions | Comparative Question Classification |
![]() ![]() |
|
Webis-CompQuestions-22 | Webis group | 2022 | 5 MB | 31K | questions | Comparative Question Classification | ||
Webis-ConcluGen-21 | Webis group | 2021 | 225 MB | 136K | argument-conclusion pairs | Informative Conclusion Generation, Text Summarization |
![]() ![]() |
|
Webis-Conversational-Query-Reformulations-21 | Webis group | 2021 | 193 KB | 3K | messages | Query classification |
![]() ![]() |
|
Webis Chatnoir-Copycat 2021 | Webis group | 2021 | 90.6 TB | 6.7 B | documents | Duplicate Detection | ||
Webis-CPC-11 | Webis group | 2011 | 19 MB | 8K | paraphrases | Plagiarism Detection |
![]() ![]() |
|
Webis-Debate-16 | Webis group | 2016 | 908 KB | 27K | text segments | Computational Argumentation |
![]() ![]() |
|
Webis-Editorial-Quality-18 | Webis group | 2018 | 3 MB | 1K | documents | Computational Argumentation |
![]() ![]() |
|
Webis-Editorials-16 | Webis group | 2016 | 5 MB | 300 | documents | Computational Argumentation |
![]() ![]() |
|
Webis-EditorialSum-20 | Webis group | 2020 | 10 MB | 1330 | editorials | Text Summarization |
![]() ![]() |
|
Webis-Exhibition-Questions-21 | Webis group | 2021 | 34 MB | 849 | questions | Conversational Analysis (written) |
![]() ![]() |
|
Webis-Gmane-19 | Webis group | 2019 | 160 GB | 153M | emails | Dialog Analysis |
![]() ![]() ![]() |
|
Webis-KIQC-13 | Webis group | 2013 | 1 MB | 3K | questions | Known-Item Search |
![]() ![]() |
|
Webis-Mnemonics-17 | Webis group | 2017 | 2 MB | 1K | mnemonics | Password analysis |
![]() ![]() |
|
Webis MS MARCO Anchor Text 2022 | Webis group | 2022 | 3.5 GB | 6.5 M | documents | Anchor Text |
![]() |
|
Webis-NIL-21 | Webis Group | 2021 | 392 KB | 37K | log entries | Query identification |
![]() ![]() |
|
Webis-ODP-10 | Webis group | 2010 | 113 MB | 5M | documents | Clustering/Cluster Labeling |
![]() ![]() |
|
Webis-PC-08 | Webis group | 2008 | 298 MB | - | Plagiarism Detection |
![]() ![]() |
||
Webis-PRA-12 | Webis group | 2012 | 884 KB | 14K | company names | Spelling Error Detection |
![]() ![]() |
|
Webis-QInC-22 | Webis group | 2022 | 79 MB | 13 MB | queries | Query Interpretation |
![]() ![]() |
|
Webis-QSeC-10 | Webis group | 2010 | 2 MB | - | Query Segmentation |
![]() ![]() |
||
Webis-QSpell-17 | Webis group | 2017 | 1 MB | - | Query Spelling Correction |
![]() ![]() |
||
Webis-QTM-19 | Webis group | 2019 | 2 MB | 200K | Queries | Query-task mapping |
![]() ![]() |
|
Webis-Revenue-10 | FG Engels & Webis group | 2010 | 6 MB | 1K | documents | Entity and Relation Extraction |
![]() ![]() |
|
Webis-SameSentiment-21 | Webis group | 2021 | 43 MB | 704K | sentiment pair ids | Sentiment Analysis |
![]() |
|
Webis-SameSide-19 | Webis group | 2020 | 63 MB | 125K | argument pairs | Computational Argumentation |
![]() |
|
Webis-SameSide-21 | Webis group | 2021 | 150 MB | - | argument pairs | Computational Argumentation |
![]() |
|
Webis-SameSideAdversarial-21 | Webis group | 2021 | 50 KB | 175 | argument pairs | Computational Argumentation |
![]() |
|
Webis-SCSmeta-21 | Webis group | 2021 | 25 KB | 1K | turns | Conversational Analysis (spoken) |
![]() ![]() |
|
Webis-SDMbridge-12 | Webis group | 2012 | 58 MB | 15K | models | Simulation Data Mining |
![]() ![]() |
|
Webis-Sentences-17 | Webis group | 2017 | 200 GB | 3B | sentences | Text statistics |
![]() ![]() |
|
Webis-SMC-12 | Webis group | 2012 | 123 KB | - | Search Mission Detection |
![]() ![]() |
||
Webis-Snippet-20 | Webis group | 2020 | 11 GB | 10M | snippet-webpage pairs | Abstractive Snippet Generation, Text Summarization |
![]() ![]() |
|
Webis-TLDR-17 | Webis group | 2017 | 2 GB | 4M | content-summary pairs | Text Summarization |
![]() ![]() |
|
Webis-TRC-12 | Webis group | 2012 | 120 MB | 150 | interaction logs | Text Reuse Detection, Paraphrasing, and Exploratory Search |
![]() ![]() |
|
Webis-Tripad-13-Sentiment | Webis group | 2013 | 3 MB | 2K | reviews | Sentiment Analysis |
![]() ![]() |
|
Webis-Tripad-14 | Webis group | 2014 | 61 MB | 266K | reviews | Sentiment Analysis and Author Profiling |
![]() ![]() |
|
Webis-Voice-based-and-Conversational-Argument-Search-20 | Webis group | 2020 | 350 KB | 500 | participants | Conversational Analysis (spoken) |
![]() ![]() |
|
Webis-Web-Archive-17 | Webis group | 2017 | 94 GB | 10K | documents | Web Analysis |
![]() ![]() |
|
Webis-Web-Archive-Quality-22 | Webis group | 2012 | 18 GB | 7K | documents | Web Analysis |
![]() ![]() |
|
Webis-Web-Errors-19 | Webis group | 2019 | 1 MB | 10K | documents | Web Analysis |
![]() ![]() |
|
Webis-WebSeg-20 | Webis group | 2020 | 12 GB | 8K | documents | Web Page Segmentation |
![]() ![]() |
|
Webis-WebSeg-20-Algorithm-Segmentations | Webis group | 2021 | 7 GB | 246K | segmentations | Web Page Segmentation |
![]() ![]() |
|
Webis-WikiDebate-18 | Webis group | 2018 | 78 MB | 6M | discussions | Computational Argumentation |
![]() ![]() |
|
Webis-WikiDiscussions-18 | Webis group | 2018 | 4 GB | 6M | discussions | Computational Argumentation |
![]() ![]() |
|
Webis-Wikipedia-Text-Reuse-18 | Webis group | 2018 | - | - | text segments | Text Reuse Analysis |
![]() ![]() |
|
Webis-WVC-07 | Webis group | 2007 | 12 KB | 1K | documents | Vandalism Detection |
![]() ![]() |
|
Webis-YouTube8MA-18 | Webis Group | 2018 | 169 GB | 6M | documents | Video Retrieval |
![]() ![]() |
PAN Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Alvi15-Text-Alignment-en-fa | 2015 | 2 MB | 200 | documents | Originality |
![]() ![]() |
||
C10-Attribution | 2015 | 4 MB | Author Identification |
![]() ![]() |
||||
C50-Attribution | 2015 | 17 MB | Author Identification |
![]() ![]() |
||||
Cheema15-Text-Alignment-en | 2015 | 4 MB | Originality |
![]() ![]() |
||||
Hanfi15-Text-Alignment-en-ur | 2015 | 3 MB | Originality |
![]() ![]() |
||||
Khoshnavataher15-Text-Alignment-fa | 2015 | 16 MB | Originality |
![]() ![]() |
||||
Kong15-Text-Alignment-zh | 2015 | 3 MB | Originality |
![]() ![]() |
||||
Mohtaij15-Text-Alignment-en | 2015 | 57 MB | Originality |
![]() ![]() |
||||
Palkovskii15-Text-Alignment-en | 2015 | 26 MB | Originality |
![]() ![]() |
||||
PAN-PC-09 | Webis group | 2009 | 2 GB | 41K | documents | Plagiarism Detection |
![]() ![]() |
|
PAN-PC-10 | Webis group | 2010 | 2 GB | 27K | documents | Plagiarism Detection |
![]() ![]() |
|
PAN-PC-11 | Webis group | 2011 | 2 GB | 27K | documents | Plagiarism Detection |
![]() ![]() |
|
PAN-SemEval-Hyperpartisan-News-Detection-19 | Webis & Factmata | 2018 | 1 GB | 751K | articles | Hyperpartisan News Detection |
![]() ![]() |
|
PAN-WQF-12 | Webis group | 2012 | 4 GB | 2M | documents | Quality Flaw Prediction |
![]() ![]() |
|
PAN-WVC-10 | Webis group | 2010 | 439 MB | 32K | documents | Vandalism Detection |
![]() ![]() |
|
PAN-WVC-11 | Webis group | 2011 | 371 MB | 24K | documents | Vandalism Detection |
![]() ![]() |
|
PAN11-Attribution | 2011 | 3 MB | Author Identification |
![]() ![]() |
||||
PAN12-Attribution | 2012 | 9 MB | Author Identification |
![]() ![]() |
||||
PAN12-Sexual-Predator-Identification | 2012 | 92 MB | Deception Detection |
![]() ![]() |
||||
PAN12-Source-Retrieval | 2012 | 1 MB | Originality |
![]() ![]() |
||||
PAN12-Text-Alignment | 2012 | 783 MB | Originality |
![]() ![]() |
||||
PAN13-Author-Profiling | 2013 | 713 MB | Author Profiling |
![]() ![]() |
||||
PAN13-Source-Retrieval | 2013 | 3 MB | Originality |
![]() ![]() |
||||
PAN13-Text-Alignment | 2013 | 35 MB | Originality |
![]() ![]() |
||||
PAN13-Verification | 2013 | 1 MB | Author Identification |
![]() ![]() |
||||
PAN14-Author-Profiling | 2014 | 205 MB | Author Profiling |
![]() ![]() |
||||
PAN14-Source-Retrieval | 2014 | 7 MB | Originality |
![]() ![]() |
||||
PAN14-Text-Alignment | 2014 | 22 MB | Originality |
![]() ![]() |
||||
PAN14-Verification | 2014 | 9 MB | Author Identification |
![]() ![]() |
||||
PAN15-Author-Profiling | 2015 | 2 MB | Author Profiling |
![]() ![]() |
||||
PAN15-Source-Retrieval | 2015 | 7 MB | Originality |
![]() ![]() |
||||
PAN15-Verification | 2015 | 3 MB | Author Identification |
![]() ![]() |
||||
PAN16-Author-Masking | PAN | 2016 | 2 MB | 205 | cases | Author Obfuscation |
![]() ![]() |
|
PAN16-Author-Profiling | 2016 | 2 MB | Author Profiling |
![]() ![]() |
||||
PAN16-Clustering | 2016 | 3 MB | Author Identification |
![]() ![]() |
||||
PAN17-Author-Profiling | 2017 | 254 MB | Author Profiling |
![]() ![]() |
||||
PAN17-Clustering | 2017 | 1 MB | Author Identification |
![]() ![]() |
||||
PAN17-Style-Change-Detection | 2017 | 8 MB | Multi-Author Analysis |
![]() ![]() |
||||
PAN18-Attribution | 2018 | 4 MB | 2K | cases | Author Identification |
![]() ![]() |
||
PAN18-Author-Profiling | PAN | 2018 | 7 GB | 8K | cases | Author Profiling |
![]() ![]() |
|
PAN18-Style-Change-Detection | 2018 | 8 MB | 3K | cases | Multi-Author Analysis |
![]() ![]() |
||
PAN19-Attribution | 2019 | 13 MB | Author Identification |
![]() ![]() |
||||
PAN19-Bots-and-Gender-Profiling | 2019 | 38 MB | Author Profiling |
![]() ![]() |
||||
PAN19-Celebrity-Profiling | 2019 | 3 GB | Author Profiling |
![]() ![]() |
||||
PAN19-Style-Change-Detection | 2019 | 10 MB | Multi-Author Analysis |
![]() ![]() |
||||
PAN20-Celebrity-Profiling | 2020 | 7 GB | Author Profiling |
![]() ![]() |
||||
PAN20-Profiling-Fake-News-Spreaders-in-Twitter | 2020 | 8 MB | Author Profiling |
![]() ![]() |
||||
PAN20-Style-Change-Detection | 2020 | 98 MB | Multi-Author Analysis |
![]() ![]() |
||||
PAN20-Authorship-Verification | 2020 | 838 MB | Authorship Verification |
![]() ![]() |
||||
PAN20-Authorship-Verification (Large) | 2020 | 4 GB | Authorship Verification |
![]() ![]() |
||||
PAN21-Authorship-Verification | 2021 | 322 MB | Authorship Verification |
![]() ![]() |
||||
PAN21-Style-Change-Detection | 2021 | 19.2 MB | Multi-Author Analysis |
![]() |
||||
PAN21-Profiling-Hate-Speech-Spreaders-on-Twitter | 2021 | 2.8 MB | Author Profiling |
![]() |
||||
PAN22-Authorship-Verification | 2022 | 23 MB | Authorship Verification |
![]() ![]() |
||||
Profiling-Irony-and-Stereotype-Spreaders-on-Twitter | 2022 | 5.7 MB | Author Profiling |
![]() |
Touché Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Touché20-Argument-Retrieval-for-Comparative-Questions | Webis group | 2020 | 3 MB | 50 | topics | Argument search |
![]() |
|
Touché20-Argument-Retrieval-for-Controversial-Questions | Webis group | 2020 | 9 MB | 50 | topics | Argument search |
![]() |
|
Touché21-Argument-Retrieval-for-Comparative-Questions | Webis group | 2021 | 200 KB | 50 | topics | Argument search |
![]() |
|
Touché21-Argument-Retrieval-for-Controversial-Questions | Webis group | 2021 | 1 MB | 50 | topics | Argument search |
![]() |
|
Touché22-Argument-Retrieval-for-Comparative-Questions | Webis group | 2022 | 700 MB | 50 | topics | Argument search |
![]() |
|
Touché22-Argument-Retrieval-for-Controversial-Questions | Webis group | 2022 | 2 GB | 50 | topics | Argument search |
![]() |
|
Touché22-Image-Retrieval-for-Arguments | Webis group | 2022 | 169 GB | 50 | topics | Argument search |
![]() |
|
Touché23-Human-Value-Detection | Webis group | 2022 | 1 MB | 5K | arguments | Computational Argumentation |
![]() |
Affiliated Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
Burrows Authorship Corpora | Steven Burrows, RMIT University | 2010 | 8 MB | - | Source Code Authorship Attribution | |||
Common Crawl | Common Crawl organization | 2009-2021 (+) | 1.7 PB | 3M | WARC files | Web Analysis | ||
CompArg: Comparative Sentences 2019 | Universität Hamburg | 2019 | 3 MB | - | Comparative Sentences Classification |
![]() ![]() |
||
Dagstuhl-15512-ArgQuality | Dagstuhl-15512 Quality breakout group | 2017 | 1 MB | 304 | arguments | Computational Argumentation |
![]() |
|
Internet Archive | Internet Archive organization | 350 TB | 800K | WARC files | Web Analysis | |||
Paderborn Genre Analysis Corpus 2012 | Baumann, Lettmann, Stein | 2012 | 20 MB | - | Web Genre Analysis |
![]() ![]() |
||
Scientific Author's Writing Style Corpus 2017 | Rexha, Kröll, Ziak, Kern | 2017 | - | 66 | cases | Authorship Attribution |
![]() ![]() |
Other Corpora | ||||||||
---|---|---|---|---|---|---|---|---|
Name | Publisher/Creator | Year | Size [bytes] | Size | [units] | Default Task | Access | |
20 Newsgroups | Carnegie Mellon University | 1999 | 18 MB | 20K | documents | Text Classification, Text Clustering | ||
7Sectors-WebKB | CMU World Wide Knowledge Base | 2001 | 6 MB | 5K | documents | Text Classification, Text Clustering | ||
A Corpus of Plagiarised Short Answers | University of Sheffield | 2009 | 80 KB | 100 | documents | Plagiarism Detection | ||
ABCD (Agreement By Create Debaters) | Sara Rosenthal | 2015 | 42 MB | 10K | dialogues | Conversation Analysis (written, human-human) | ||
AgreeSum | New York University | 2021 | 12 MB | 18K | multiple articles-summary pairs | Text Summarization, Multi-document | ||
AWTP (Agreement in Wikipedia Talk Pages) | Sara Rosenthal | 2012 | 235 KB | 822 | dialogues | Conversation Analysis (written, human-human) | ||
All The News | Kaggle | 2020 | 3.1 GB | 2.7M | news articles | Text Summarization, Text Analysis | ||
Annotated Customer Reviews | Simon Fraser University Burnaby | 2004 | 870 KB | - | Sentiment Analysis | |||
Any-Aspect Summarization | Carnegie Mellon University | 2020 | 1.5 GB | 280K | article-summary pairs | Text Summarization | ||
AOL Query Log | AOL | 2006 | 2 GB | 112M | queries | Query Log Analysis | ||
Argument Annotated Essays, v1 | TU Darmstadt | 2014 | 7 MB | 90 | essays | Computational Argumentation | ||
Argument Annotated Essays, v2 | TU Darmstadt | 2016 | 6 MB | 402 | essays | Computational Argumentation | ||
Araucaria Argumentation Corpus | University of Dundee | 2014 | 9 MB | 664 | examples | Computational Argumentation | ||
Arguing Subjectivity Corpus | University of Pittsburgh | 2012 | 732 KB | 84 | documents | Computational Argumentation | ||
Arxiv-PubMed Corpus | Georgetown University | 2018 | 4.2 GB | 350K | article-abstract pairs | Text Summarization, Scientific Document Summarization | ||
Bergsma-Wang-Corpus 2007 | S. Bergsma and Q. I. Wang | 2007 | 2 MB | 2K | queries | Web Search Analysis | ||
BigPatent Summarization Corpus | Khoury College of Computer Sciences | 2019 | 6 GB | 1M | article-summary pairs (US patents) | Text Summarization | ||
Bill Summarization Corpus | FiscalNote Research | 2019 | 64 MB | 22K | article-summary pairs (US bills) | Text Summarization | ||
BLOGS06 test collection | University of Glasgow | 2006 | - | 4M | documents | Link Analysis | ||
BNC Writing Errors | J. Wagner et al. | 2007 | 274 MB | - | Writing Error Detection | |||
British National Corpus (XML) | BNC Consortium | 2007 | 5 GB | 4K | texts | Text Analysis (English) | ||
Brown Corpus | Brown University | 2011 | 22 MB | 500 | documents | Text Analysis (English) | ||
Change My View Modes | Columbia University | 2017 | - | 78 | discussion threads | Computational Argumentation | ||
CEEAUS 2010 Beta Edition | Kobe University | 2010 | - | 2K | documents | Cross-Language Analysis | ||
CLEANEVAL 2007 | University of Trento and University of Leeds | 2007 | 15 MB | 1K | documents | Main Content Extraction | ||
CLEF-IP 2009 | Information Retrieval Facility Society (IRF) | 2009 | 14 GB | 2M | documents | Patent Retrieval | ||
CLEF-IP 2010 | Information Retrieval Facility Society (IRF) | 2010 | 9 GB | 3M | documents | Patent Retrieval | ||
ClueWeb09 | Carnegie Mellon University | 2009 | 4 TB | 1B | web pages | Web Mining | ||
ClueWeb12 | Carnegie Mellon University | 2012 | 5 TB | 733M | web pages | Web Mining | ||
CNN-DailyMail | IBM | 2016 | 1 GB | 200K | article-summary pairs | Text Summarization | ||
CoNLL-2003 | University of Antwerpen | 2003 | 12 MB | - | Named Entity Recognition | |||
ConvoSumm Corpus | Yale University | 2021 | 650 MB | 500 | comments-summary pairs | Text Summarization, Dialogue Summarization | ||
CoPhIR | Consiglio Nazionale delle Ricerche (ISTI-CNR) | 2003 | 54 GB | 106M | images | Image Retrieval | ||
CORE | The Open University | 2018 | 330 GB | 123M | documents | Data Mining | ||
DBLP | University of Massachusetts Amherst | 2006 | 910 MB | - | Network Analysis | |||
Dbpedia 3.5 | DBpedia | 2010 | 8 GB | - | Data Mining | |||
DialogSum Corpus | Zhejiang University | 2021 | 4 MB | 13K | dialogue-summary pairs with topics | Text Summarization, Dialogue Summarization | ||
DMOZ | Open Directory Project | 2010 | 11 GB | - | Clustering and Clusterlabeling and Data Mining | |||
DoQA | Ixa | 2020 | 4 MB | 2437 | dialogues | Conversation Analysis (written, human-human) | ||
ECML PKDD Discovery Challenge 2008 | ECML | 2008 | 304 MB | 17M | lines | Collaborative Filtering and Spam Detection | ||
ESL 123 Mass Noun Examples | Microsoft Corporation | 2006 | 204 KB | 123 | sentences | Cross-Language Analysis | ||
Essay Argument Strength | UT Dallas | 2015 | 30 KB | 1K | scores | Essay scoring | ||
Essay Organization | UT Dallas | 2010 | 30 KB | 1K | scores | Essay scoring | ||
Essay Prompt Adherence | UT Dallas | 2014 | 38 KB | 830 | scores | Essay scoring | ||
Essay Thesis Clarity | UT Dallas | 2013 | 6 MB | 830 | scores | Essay scoring | ||
Finegrained Sentiment | Uppsala University | 2011 | 4 MB | 294 | reviews | Sentiment Analysis | ||
European Corpus Initiative Multilingual Corpus I | European Corpus Initiative | 1994 | 824 MB | 49M | words | Text Analysis (Multilingual) | ||
Europarl (v1 & v3) | University of Edinburgh | 2007 | 3 GB | - | Machine Translation | |||
Falko Essaykorpus L2 V2 | Institut für deutsche Sprache und Linguistik | 2005 | 5 MB | 248 | documents | Interlanguage Analysis | ||
General Inquirer Dictionary | Harvard University | 1966 | 4 MB | 182 | categories | Sentiment Analysis | ||
Google Books N-Gram 20090715 | 2009 | 898 GB | - | Data Mining | ||||
Google Web 1T 5-gram Version 1 | 2006 | 55 GB | 5B | n-grams | Text Analysis (English) | |||
IBM Debater- Claim Sentences Search | IBM | 2018 | 600 MB | 2M | topic conclusion pairs | Argument Search | ||
IBM Debater- Evidence Sentences | IBM | 2018 | 3 MB | 6K | topic premise pairs | Argument Search | ||
IBM Debater- Claims and Evidence, EMNLP-2015 | IBM | 2015 | 8 MB | 5K | topic argument pairs | Argument Mining | ||
IBM Debater- Claims and Evidence, ACL-14 | IBM | 2014 | 3 MB | 1K | topic argument pairs | Argument Mining | ||
IBM Debater- Claim Stance Dataset | IBM | 2017 | 8 MB | 2K | topic conclusion | Stance Classification | ||
IBM Debater- Sentiment Lexicon of Idiomatic Expressions | IBM | 2018 | 3 MB | 5K | phrases | Sentiment Analysis | ||
IBM Debater- Sentiment Composition Lexicon | IBM | 2018 | 10 MB | 66K | words | Sentiment Analysis | ||
IBM Debater- Wikipedia Category Stance | IBM | 2018 | 1 MB | 5K | wikipedia category | Stance Classification | ||
IBM Debater- Word | IBM | 2018 | 4 MB | 19K | wikipedia concept pairs | Semantic Relatedness | ||
IBM Debater- TR9856 | IBM | 2015 | 2 MB | 10K | phrase pairs | Semantic Relatedness | ||
IBM Debater- Mention Detection Benchmark | IBM | 2018 | 2 MB | 3K | sentences | Mention Detection | ||
IBM Debater- Recorded Debating Dataset | IBM | 2018 | 2 MB | 60 | discussions | Computational Argumentation | ||
ICWSM 2009 Data Challenge | ICWSM | 2009 | 37 GB | - | Network Analysis | |||
imat2009 dataset | Yandex | 2009 | 650 MB | - | Machine-learned Ranking | |||
Intelligence Squared Debates (IQ2) | Zhang et al. | 2016 | 4 MB | 108 | dialogues | Conversation Analysis (spoken, human-human) | ||
International Corpus of Learner English v2 | Center for English Corpus Linguistics | 2009 | 92 MB | 6K | documents | Language Analysis | ||
Internet Argument Corpus v2 | [email protected] Santa Cruz | 2016 | 3 GB | 11K | dialogues | Conversation Analysis (written, human-human) | ||
IP2Location LITE databases 2016-20 | IP2Location | 2016-2019 | 5 GB | 5 | years | IP-geolocation and proxies | ||
The JRC-Acquis Multilingual Parallel Corpus (3) | European Commission's Office for Official Publications (OPOCE) | 2009 | 2 GB | - | Cross-Language Research | |||
Topical Chat Dataset | Amazon | 2019 | 76 MB | 11K | dialogues | Conversation Analysis (written, human-human) | ||
Key-value Retrieval Dataset | Stanford University | 2017 | 1 MB | 3K | dialogues | Conversation Analysis (written, human-wizard) | ||
Koppel Authorship Corpus | M. Koppel and J. Schler | 2004 | 4 MB | - | Authorship Verification | |||
Learning To Rank 3 | Microsoft | 2008 | 8 GB | - | Machine-learned Ranking | |||
Lee 50 Documents | M. D. Lee et al. | 2005 | 130 KB | 50 | documents | Text Similarity Analysis | ||
Maluuba Frames | Maluuba (Microsoft) | 2017 | 4 MB | 1K | dialogues | Conversation Analysis (written, human-wizard) | ||
MANtIS | Lambda-Lab at TU Delft | 2019 | 6 GB | 80K | dialogues | Conversation Analysis (written, human-human) | ||
MediaSum Corpus | Microsoft Cognitive Services Research Group | 2021 | 1.5 GB | 463K | interview transcript-summary pairs | Text Summarization, Dialogue Summarization | ||
MEDLINE-PubMed Corpus | University of Zürich | 2018 | 7 GB | 5M | article-abstract & abstract-title pairs | Text Summarization, Scientific Document Summarization | ||
METER Corpus | Department of Journalism and Department of Computer Science at Sheffield University | 2002 | 10 MB | - | Text Reuse | |||
MIR Flickr 2008 | LIACS Medialab at Leiden University, Netherlands | 2008 | 3 GB | 25K | documents | Image Retrieval | ||
MISC | Microsoft | 2017 | 23 GB | 110 | dialogues | Conversation Analysis (spoken, human-human) | ||
Movielens | University of Minnesota | 1998-2009 | 74 MB | 11M | ratings | Collaborative Filtering | ||
Movie Review Data | Cornell University | 2004-2005 | 219 MB | 12K | reviews | Sentiment Analysis | ||
MPC (Multi-Party Chat) | Shaikh et al. | 2010 | 2 MB | 14 | dialogues | Conversation Analysis (written, human-human) | ||
MSMARCO Conversational Search | Microsoft | 2019 | 1 GB | 2M | synthetic search sessions | Next Query Prediction | ||
Multi Domain Sentiment Dataset (Processed ACL) | John Hopkins University | 2007 | 29 MB | - | Sentiment Analysis | |||
Multilingual Amazon Reviews | P. Keung et al. | 2020 | 640 MB | 1.3M | reviews | Text Classification (Multilingual) | ||
Multi-Aspect Summarization | Amazon Research | 2019 | 946 MB | 280K | article-summary pairs | Text Summarization | ||
Multi-News | Yale University | 2019 | 676 MB | 54K | multiple articles-summary pairs | Text Summarization, Multi-document | ||
MultiWOZ 2.1 | M. Eric et al. | 2020 | 19 MB | 10K | dialogues | Conversation Analysis (written, human-wizard) | ||
Multi-XScience | Mila | 2020 | 61.3 MB | 40K | article-summary pairs | Text Summarization, Scientific Document Summarization | ||
Montclair Electronic Language Database | Montclair State University | 2001 | 56 KB | 33 | documents | Cross-Language Analysis | ||
Netflix Challenge (Partial) | Netflix | 2006 | 2 GB | - | Collaborative Filtering | |||
Newsroom | Cornell University | 2018 | 5 GB | 1.3M | article-summary pairs | Text Summarization | ||
New York Times Corpus | New York Times | 2008 | 3 GB | 2M | articles | Text Mining | ||
NBC 2016 Russian Troll Tweets | NBC | 2018 | 34 MB | 267K | tweets | Propaganda detection | ||
ODP239 | C. Carpineto and G. Romano | 2009 | 5 MB | - | Subtopic Information Retrieval | |||
OHSUMED Test Collection | Oregon Health & Science University | 1994 | 461 MB | - | Text Clustering | |||
OpenWebText Corpus | Brown University | 2019 | 40 GB | 8M | documents | Language Modeling, Text Synthesis | ||
OPUS (Europarl3_0b and EMEA0) | Jörg Tiedemann | 2009 | 9 GB | 22 | languages | Machine Translation | ||
OR-QuAC | C. Qu et al. | 2020 | 10 GB | 6K | dialogues | Conversation Analysis (written, human-wizard), Question Answering | ||
QuAC | E. Choi et al. | 2018 | 75 MB | 14K | dialogues | Conversation Analysis (written, human-wizard), Question Answering | ||
RadioTalk | Laboratory for Social Machines, MIT Media Lab | 2019 | 9 GB | 3B | words | Language Analysis | ||
Reason Identification and Classification Dataset | UT Dallas | 2014 | 4 MB | - | Computational Argumentation | |||
Reddit TIFU corpus | Seoul National University | 2019 | 640 MB | 123K | content-summary pairs | Text Summarization | ||
Reuters 21578 (22173) | Reuters, David D. Lewis | 1996 | 8 MB | 22K | articles | Text Clustering | ||
Reuters RCV1 | Reuters, David D. Lewis | 2000 | 1 GB | 365 | documents | Text Clustering | ||
Reuters RCV1 - CCAT split | Reuters, David D. Lewis | 2002 | 2 GB | - | Machine Learning | |||
Reuters RCV1/RCV2 Multilingual, Multiview Text Categorization Test Collection | National Research Council of Canada | 2009 | 166 MB | - | Cross-Language Categorization | |||
Request For Comments Collections (to 4501) | RFC Editor | 2008 | 55 MB | 4K | documents | Data Mining | ||
Rovereto Twitter N-Gram Corpus | University of Trento, Italy | 2011 | 5 GB | 75M | tweets | Social Network Analysis | ||
ScisummNet Corpus | Yale University | 2019 | 15 MB | 1000 | scientific paper-summary pairs (with citation networks) | Text Summarization, Scientific Document Summarization | ||
SILS Learner Corpus of English | Waseda University | 2007 | 16 MB | - | Cross-Language Analysis | |||
SMS Spam Collection v | T. A. Almeida and J. M. G. Hidalgo | 2011 | 210 KB | 6K | messages | Spam Identification | ||
Spoken Conversational Search Data Set | J.R. Trippas et al. | 2017 | 260 KB | 101 | dialogues | Conversation Analysis (written, human-human) | ||
Spotify Podcasts Dataset | Clifton et al. | 2020 | 2 TB | 50K | hours | Conversation Analysis (spoken, human-human) | ||
SumPubMed Corpus | University of Utah | 2021 | 608 MB | 33K | scientific paper-summary pairs | Text Summarization, Scientific Document Summarization | ||
TED-LIUM Release 3 | Ubiqus and LIUM | 2018 | 50 GB | 452 | hours | Speech Recognition | ||
TIPSTER Complete | Advanced Research Projects Agency | 1993 | 1 MB | - | Information Retrieval | |||
TREC vol4 | National Institute of Standards and Technology (NIST) | 1996 | 436 MB | 295K | documents | Data Mining | ||
TREC vol5 | National Institute of Standards and Technology (NIST) | 1997 | 389 MB | 260K | documents | Data Mining | ||
TREC web | National Institute of Standards and Technology (NIST) | 1999-2004 | 90 GB | - | Data Mining | |||
TripAdvisor Data Set | University of Illinois at Urbana-Champaign | 2010 | 220 MB | - | Opinion Mining | |||
Tswana Learner English Corpus | Center for Text Technology | 2006 | 2 MB | - | Cross-Language Analysis | |||
Twitter tweets | Yang and Leskovec | 2011 | 26 GB | 467M | tweets | Social Network Analysis | ||
Twitter tweets (RecSys Challenge) | 2020 | 76 GB | 160M | tweets | Social Network Analysis | |||
UKPConvArg1 | TU Darmstadt | 2016 | 21 MB | 16K | argument pairs | Computational Argumentation | ||
UKPConvArg2 | TU Darmstadt | 2016 | 23 MB | 9K | argument pairs | Computational Argumentation | ||
USPTO Patents from 2001 to 2010 | U.S. Patent & Trademark Office | 2010 | 10 TB | - | Patent Analysis | |||
Uppsala Student English | Uppsala University | 2001 | 3 MB | 2K | documents | Cross-Language Analysis | ||
VQuAnDa | Kacupaj et al. | 2020 | 2 MB | 5K | question-answer-SPARQL query triplets | Answer Verbalization | ||
WaCKy: deWaC | Web-As-Corpus Kool Yinitiative | 2009 | 26 GB | 2B | words | Text Analysis (German) | ||
WaCKy: frWaC | Web-As-Corpus Kool Yinitiative | 2009 | 5 GB | 2B | words | Text Analysis (French) | ||
WaCKy: itWaC | Web-As-Corpus Kool Yinitiative | 2009 | 31 GB | 2B | words | Text Analysis (Italian) | ||
WaCKy: sdeWaC | Web-As-Corpus Kool Yinitiative | 2009 | 20 GB | 1B | words | Text Analysis (German) | ||
WaCKy: ukWaC | Web-As-Corpus Kool Yinitiative | 2009 | 15 GB | 2B | words | Text Analysis (English) | ||
WaCKy: WaCkypedia_EN | Web-As-Corpus Kool Yinitiative | 2009 | 6 GB | 1B | words | Text Analysis (English) | ||
WCEP MDS Dataset: Wikipedia Current Events Portal | Aylien Ltd., Dublin, Ireland | 2020 | 2 GB | 2.39M | document clusters with one human-written summary per cluster | Text Summarization, Multi-document | ||
Web People Search Corpus (WePS-1) | NLP Group (UNED), Proteus Project (NYU) | 2007 | 295 MB | 2K | web pages | Person Disambiguation, Text Clustering | ||
Web People Search Corpus (WePS-2) | NLP Group (UNED), Proteus Project (NYU) | 2009 | 328 MB | 3K | web pages | Person Disambiguation, Text Clustering | ||
Web People Search Corpus (WePS-3) | NLP Group (UNED), Proteus Project (NYU) | 2010 | 571 MB | 50K | web pages | Person Disambiguation, Text Clustering | ||
WikiHow Summarization Corpus | University of California | 2018 | 2 GB | 230K | article-summary, paragraph-summary pairs | Text Summarization | ||
Wikipedia Revision Dump | Wikimedia Foundation | 2006 | 46 GB | - | Data Mining | |||
Wikipedia Revision Dump | Wikimedia Foundation | 2008 | 133 GB | - | Data Mining | |||
Wikipedia Full Dump | Wikimedia Foundation | 2011 | 5 TB | - | Data Mining | |||
Wikipedia History Snapshots | Wikimedia Foundation | 2006-2012 | 32 GB | - | Data Mining | |||
Wikipedia Snapshots | Wikimedia Foundation | 2006-2012 | 280 GB | - | Data Mining | |||
WikiSum Corpus | Amazon | 2021 | 115 MB | 40K | article-summary pairs | Text Summarization | ||
Wikipedia Participation Challenge | Wikimedia Foundation | 2011 | 976 MB | - | User Behaviour Prediction | |||
Wordsim353 | L. Finkelstein et al. | 2002 | 60 KB | 353 | word pairs | Word Similarities | ||
Wortschatz Leipzig | Universität Leipzig | 2006 | 8 GB | 15 | languages | Text Analysis (Multilingual) | ||
XL-Sum Corpus | Bangladesh University of Engineering and Technology | 2021 | 1.3 GB | 1.35M | article-summary pairs | Text Summarization, Multilingual Text Summarization | ||
XSum Corpus | University of Edinburgh | 2018 | 240 MB | 214K | article-summary pairs | Text Summarization | ||
Yahoo Learning To Rank Challenge 2010 | Yahoo | 2010 | 421 MB | - | Document Ranking | |||
Yahoo N-Grams | Yahoo | 2006 | 13 GB | - | Text Analysis (English) |