Synopsis
The Webis Text Reuse Corpus 2012 (Webis-TRC-12) compiles manually written documents obtained from a completely controlled, yet representative environment that emulates the web. Each document in the corpus is about one of the 150 topics used at the TREC Web Tracks 2009–2011, thus forming a strong connection with existing evaluation efforts. Writers, hired at the crowdsourcing platform oDesk, had to retrieve sources for a given topic and to reuse text from what they found. Part of the corpus are detailed interaction logs that consistently cover the search for sources as well as the creation of documents. This will allow for in-depth analyses of how text is composed if a writer is at liberty to reuse texts from a third party.
Interactively explore the essay writing data in the Webis-TRC-12 Essay Viewer.
Download
You can access the Webis-TRC-12 corpus on Zenodo.
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.
Note that the archive file is highly compressed due to many near-redundant essay revisions; uncompressed, the corpus takes up 18GiB of space.
People
Students: Jakob Gomoll, Marie Bornemann, Lene Ganschow, Abdul-Hamid Sabri, Florian Kneist