The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.

The corpus is intended to train and evaluate automated approaches for the prediction of quality flaws in Wikipedia; for more information, refer to Anderka et al. (2012).

A subset of the corpus has been used in the 1st Competition on Quality Flaw Prediction in Wikipedia, held in conjunction with the PAN 2012 evaluation lab at the CLEF 2012 conference. Details about the competition can be found in Anderka and Stein (2012).


You can access the PAN-WQF-12 corpus on Zenodo

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to cite the corpus via [bib]. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.


The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.

Flaw NameFlaw DescriptionArticles
Advert The article is written like an advertisement. 2,217
Empty section The article has at least one section that is empty. 11,514
No footnotes The article’s sources remain unclear because of its inline citations. 12,136
Notability The article does not meet the general notability guideline. 6,299
Original research The article contains original research. 1,014
Orphan The article has fewer than three incoming links. 42,712
Primary sources The article relies on references to primary sources. 7,363
Refimprove The article needs additional citations for verification. 46,288
Unreferenced The article does not cite any references or sources. 75,144
Wikify The article needs to be wikified (internal links and layout). 3,541
Distribution of tagged articles

For more details about the corpus, refer to the readme file enclosed in the archive.


Students: Michael Völske