The PAN Wikipedia Quality Flaw Corpus 2012, PAN-WQF-12, provides human-labeled English Wikipedia articles that contain specific quality flaws.
The corpus is intended to train and evaluate automated approaches for the prediction of quality flaws in Wikipedia; for more information, refer to Anderka et al. (2012).
A subset of the corpus has been used in the 1st Competition on Quality Flaw Prediction in Wikipedia, held in conjunction with the PAN 2012 evaluation lab at the CLEF 2012 conference. Details about the competition can be found in Anderka and Stein (2012).
To download the corpus use the following link:
(3.8 GB, MD5 sum: 3d3ec4d71c707def537e7225169201df). [readme]
For research purposes the corpus can be used free of charge. If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
The corpus comprises 1,592,226 articles extracted from the English Wikipedia snapshot from January 4th, 2012. A subset of 208,228 articles is labled with ten specific quality flaws, which are listed in the following table. The labeling is based on human-defined cleanup tags. In addition, the corpus comprises 1,383,998 articles that have not been tagged with any cleanup tag.
|Flaw Name||Flaw Description||Articles|
|Advert||The article is written like an advertisement.||2,217|
|Empty section||The article has at least one section that is empty.||11,514|
|No footnotes||The article’s sources remain unclear because of its inline citations.||12,136|
|Notability||The article does not meet the general notability guideline.||6,299|
|Original research||The article contains original research.||1,014|
|Orphan||The article has fewer than three incoming links.||42,712|
|Primary sources||The article relies on references to primary sources.||7,363|
|Refimprove||The article needs additional citations for verification.||46,288|
|Unreferenced||The article does not cite any references or sources.||75,144|
|Wikify||The article needs to be wikified (internal links and layout).||3,541|
For more details about the corpus, refer to the readme file enclosed in the archive (readme).
Students: Michael Völske