Paderborn Genre Analysis Corpus 2012


The Paderborn Genre Analysis 2012 corpus (PaGA-12) contains 1,639 HTML documents of 26 genres. All documents were collected from 2009-10-18 to 2009-11-20, and each document is manually assigned to exactly one genre. For each genre, the corpus provides at least 50 documents.


You can access the PAN-WVC-11 corpus on Zenodo.

If you use the dataset in your research, please send us a copy of your publication. If you additionally want to link to the dataset, please use the dataset's [doi] for a stable link.


All HTML documents contain German text only, and framesets are removed. The corpus is delivered in form of a MySQL database dump; the database structure is detailed in a README file delivered with the corpus.

