The web genre corpus 2004 (Genre-KI-04) is designed for the evaluation of techniques for genre classification. It consists of 1239 web documents classified into 8 genres and basic meta data for each of the files.


To download the corpus use the following link:

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].


The corpus consists of the HTML documents grouped into directories according to their respective genre. The first lines of each document contain the meta information for each document in a HTML comment. This information includes the URL the document was downloaded from as well as the document title and the parsed text.

A definition of the genres can be found in the paper [doi] [pdf] [bib] or in the corpus. The distribution of the documents among the genres is summarized in the table below.

GenreNumber of Documents
Articles 127 10.2%
Discussion 127 10.3%
Download 152 12.3%
Help 140 11.3%
Link lists 208 16.8%
Portrait (non private) 179 14.4%
Portrait (private) 131 10.6%
Shop 175 14.1%
Sum 1239 100%
Distribution of genres