This corpus is outdated. Please use its successors PAN-WVC-10 and PAN-WVC-11.

The Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia. For research purposes the corpus can be used free of charge.


To download the corpus use the following link:

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].


As part of our research on automatic vandalism detection we have compiled a corpus of vandalism cases found in Wikipedia. The corpus is the first standardized test collection for the comparison of vandalism detection algorithms. It comprises 940 edits from which 301 are marked as vandalism by human evaluators. The corpus is based in part on the results of a study conducted by the Wikipedia community.