The Paderborn Genre Analysis 2012 corpus (PaGA-12) contains 1,639 HTML documents of 26 genres. All documents were collected from 2009-10-18 to 2009-11-20, and each document is manually assigned to exactly one genre. For each genre, the corpus provides at least 50 documents.
To download the corpus use the following link:
(20.7 MB, MD5 sum: c9652cb695e35698fe0459a2bc6b6aa3)
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus as follows:
- Michael Baumann, Theodor Lettmann and Benno Stein. Paderborn Genre Analysis Corpus 2012 (PaGa-12). http://www.uni-weimar.de/medien/webis/corpora, 2012. [corpus]
All HTML documents contain German text only, and framesets are removed. The corpus is delivered in form of a MySQL database dump; the database structure is detailed in a README file delivered with the corpus.