Synopsis

WEGA (Web Genre Analysis) is a technology for enriching Internet search results with genre information. For each snippet in a result list WEGA analyzes the type, purpose, or target group (= genre) of the underlying document, and labels the snippet as <discussion page>, <article>, <online shop>, <download site>, <private homepage>, <commercial homepage>, or <help site>. Since genre information is generally accepted as positive or negative filtering criterion, it simplifies finding the most relevant results. [video]

Research

The WEGA project addresses the following challenges:

  • Conceptual Genre Palette. WEGA aims at helping information seekers, and hence the genre palette should address a "typical" user's information needs. Based on a user study, we chose to support the following genres: <discussion page>, <online shop>, <download site>, <private homepage>, <commercial homepage>, and <help site>.
  • Novel Retrieval Models for Genre Classification. Retrieval models for Web genre classification have been proposed since the year 2000. These models are based on HTML tag statistics, linguistic analyses, simple text statistics, and manually compiled word lists. However, the linguistic statistics are often expensive to compute, and hypotheses learned from HTML tags do not generalize well since the diversity in the Web is difficult to be reflected by a training corpus. WEGA addresses these issues with a new retrieval model that is based on the analysis of core vocabulary distributions. Our genre retrieval model allows for efficient feature computation while providing an acceptable classification performance at the same time.
  • Development of a Firefox Add-On. The current WEGA prototype is implemented as an Add-On for the popular Firefox browser and labels Google search result lists. Unlike previous versions no additional server technology is needed: each document in a result list is loaded in the browser and analyzed in the background with JavaScript. Document download, analysis, and labeling happen asynchronously and do not hamper the user.

People

Students: Martin Kausche, Hagen-Christian Tönnies, and David Wiesner

Publications