The web genre corpus 2004 (Genre-KI-04) is designed for the evaluation of techniques for genre classification. It consists of 1239 web documents classified into 8 genres and basic meta data for each of the files.
To download the corpus use the following link:
(11.0 MB, MD5 sum: c49a30e62019f90ebd1d28f7f62f9bac)
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
The corpus consists of the HTML documents grouped into directories according to their respective genre. The first lines of each document contain the meta information for each document in a HTML comment. This information includes the URL the document was downloaded from as well as the document title and the parsed text.
|Genre||Number of Documents|
|Portrait (non private)||179||14.4%|