Wikipedia Fingerprinting
Synopsis
This project focuses on effective indexing technology for retrieval tasks with an inherently quadratic nature, such as plagiarism analysis, near-duplicate detection, and high similarity search. To demonstrate our technology we have compiled the search index 'Wikipedia in the Pocket' that allows long text queries (even full-text queries) against Wikipedia. The index covers about 1.5 million English Wikipedia articles and 0.5 million German articles, and it fits—along with a search interface and a servlet engine—on a conventional CD (0.7 gigabyte).
Building blocks of our indexing technology are similarity hashing and minimal perfect hashing. Regarding the former we have proposed a new kind of tolerant (fuzzy) hashcodes for larger portions of text (50–1000 words). Fuzziness is introduced by generalizing distribution properties of the document terms as a small set of parameters, which in turn are encoded as—what is here called—fingerprint. The probability that the fingerprints of two documents are equal is close to one for documents with a high cosine similarity. Fuzzy hashcodes of indexed documents are mapped to memory positions, which point to a postlist referencing all documents associated with the hashcode in question. An optimum solution for this problem is a hash function which (1) maps all keys without hash collisions to the available memory positions, and which (2) has an upper bound for the average number of additional positions needed to ensure the first property. Hash functions with the first property are called 'perfect', they are called 'minimal' if the second property's upper bound is a small constant. For a previously known set of keys minimal perfect hash functions can be constructed in linear time and space. For Wikipedia this prerequisite is fulfilled.