Synopsis

This corpus is outdated. Please use its successor PAN-PC-11.

The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

Download

To download the corpus use the following links (consider to use a download manager):

All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].

You might also be interested the following items:

Research

The PAN-PC-10 can be used to evaluate the following retrieval task:

  • Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source section.

The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

A detailed description of the corpus construction can be found in the associated publication.

People

Students: Andreas Eiselt

Publications