Wikipedia Fingerprinting

Synopsis

In this project we develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach provides a powerful technology for retrieval tasks such as plagiarism analysis, near-duplicate detection, and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 1.5 million English Wikipedia articles and 0.5 million German articles. This index - along with a search interface and a servlet engine - fits on a conventional CD (0.7 gigabyte).

Research

The ingredients of our indexing technology are similarity hashing and minimal perfect hashing. With respect to the former we have developed a new kind of tolerant ("fuzzy") hashcodes for larger portions of text (50-1000 words). Fuzziness is introduced by generalizing divergence properties of a document's terms as a small set of parameters, which in turn are encoded as - what is here called - fingerprint. The probability that the fingerprints of two documents are equal is close to one for documents with a high cosine similarity.

Fuzzy hashcodes of indexed documents are mapped to memory positions, which point to a postlist referencing all documents associated with the hashcode in question. An optimal solution for this problem is a hash function which (i) maps all keys without hash collisions to the available memory positions, and (ii) has an upper bound for the average number of additional positions needed to ensure the first property. A hash function with the first property is called perfect; moreover, it is called minimal if the second property's upper bound is a small constant. For a previously known set of keys, minimal perfect hash functions can be constructed in linear time and space. This prerequisite is fulfilled for Wikipedia in the Pocket because all documents to be indexed are known in advance. We call this a half-open retrieval situation.

People

Students: Dennis Hoppe, Martin Trenkmann

Publications