WEGA (Web Genre Analysis) is a technology for enriching Internet search results with genre information. For each snippet in a result list WEGA analyzes the type, purpose, or target group (= genre) of the underlying document, and labels the snippet as <discussion page>, <article>, <online shop>, <download site>, <private homepage>, <commercial homepage>, or <help site>. Since genre information is generally accepted as positive or negative filtering criterion, it simplifies finding the most relevant results. [video]
The WEGA project addresses the following challenges:
- Conceptual Genre Palette. WEGA aims at helping information seekers, and hence the genre palette should address a "typical" user's information needs. Based on a user study, we chose to support the following genres: <discussion page>, <online shop>, <download site>, <private homepage>, <commercial homepage>, and <help site>.
- Novel Retrieval Models for Genre Classification. Retrieval models for Web genre classification have been proposed since the year 2000. These models are based on HTML tag statistics, linguistic analyses, simple text statistics, and manually compiled word lists. However, the linguistic statistics are often expensive to compute, and hypotheses learned from HTML tags do not generalize well since the diversity in the Web is difficult to be reflected by a training corpus. WEGA addresses these issues with a new retrieval model that is based on the analysis of core vocabulary distributions. Our genre retrieval model allows for efficient feature computation while providing an acceptable classification performance at the same time.
Students: Martin Kausche, Hagen-Christian Tönnies, and David Wiesner