The PAN plagiarism corpus 2011 (PAN-PC-11) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge.


To download the corpus use the following links (consider to use a download manager):

All parts are required. Inflate only the first part, the other two parts will be inflated automatically by your archiver.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].

You might also be interested the following items:


The PAN-PC-11 can be used to evaluate the following retrieval task:

  • External Plagiarism Detection. Given a set of suspicious documents and a set of source documents, the task is to find all plagiarized sections in the suspicious documents and their respective source sections in the source documents.
  • Intrinsic Plagiarism Detection. Given only a set of suspicious documents, the task is to identify all plagiarized sections, e.g., by detecting writing style breaches. The comparison of a suspicious document with other documents is not allowed in this task.

The PAN-PC-11 contains documents in which plagiarism has been inserted automatically as well as documents in which plagiarism has been inserted manually. The former have been constructed using a so-called random plagiarist, a computer program which constructs plagiarism according to a number of parameters, while the latter have been obtained with crowdsourcing via Amazon's Mechanical Turk.

A detailed description of the corpus construction can be found in the associated publication.


Students: Andreas Eiselt