Synopsis

This corpus is outdated. Please use its successor PAN-PC-11.

The PAN plagiarism corpus 2010 (PAN-PC-10) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.

Download

You can access the PAN-PC-10 corpus on Zenodo.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.

You might also be interested the following items:

Research

The PAN-PC-10 can be used to evaluate the following retrieval task:

  • Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and, if available, the corresponding source section.

The PAN-PC-10 contains documents in which artificial plagiarism has been inserted automatically as well as documents in which simulated plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

A detailed description of the corpus construction can be found in the associated publication.

People

Students: Andreas Eiselt

Publications