This corpus is outdated. Please use its successor PAN-PC-11.

The Webis plagiarism corpus 2008 (Webis-PC-08) is a corpus for the evaluation of automatic plagiarism detection algorithms. For research purposes the corpus can be used free of charge, however, since the documents in the corpus are not free of copyrights we need assurance that you have legal access to the ACM digital library.


To obtain this corpus, you need to prove that you have legal access to the ACM digital library since the corpus comprises documents collected from there. Please contact your university library to obtain a written and signed letter that verifies your access rights. After that, please contact the people mentioned below by mail and send the letter along in order to be given access to the following file:

    (299 MB, MD5 sum: dd791566da7d031d1d155c78945ff2a2)

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].


Today's research on plagiarism detection lacks the availability of a reference collection of plagiarism cases which can be used as a yardstick to compare different detection approaches. In our project we have set up a corpus of artificial plagiarism cases: 101 monographs were collected from the ACM digital library, and to each of them chunks of text from other monographs were added manually. This was done in various ways in order to simulate two different kinds of plagiarism, accurate copies and modified copies. Cross-language plagiarism cases are currently not part of the corpus. The corpus is suited for the evaluation of external plagiarism detection algorithms, but also for intrinsic plagiarism detection algorithms such as writing style analyses.


Students: Marion Kulig