This corpus is outdated. Please use its successors PAN-WVC-10 and PAN-WVC-11.

The Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia. For research purposes the corpus can be used free of charge.


You can access the Webis-WVC-07 corpus on Zenodo.

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.


As part of our research on automatic vandalism detection we have compiled a corpus of vandalism cases found in Wikipedia. The corpus is the first standardized test collection for the comparison of vandalism detection algorithms. It comprises 940 edits from which 301 are marked as vandalism by human evaluators. The corpus is based in part on the results of a study conducted by the Wikipedia community.