The mapping between people and their names is not one-to-one. When searching the World Wide Web for a person, one is immediately faced with this problem: search results contain Web pages of different individuals having the same name. Since an individual can have several Web pages, search results get cluttered, especially when searching for people with rather common names. The problem can only be addressed with a deeper semantic analysis: Web page content must be interpreted in multiple respects in order to distinguish between different individuals even if they have the same name. This grouping problem is also referred to as "multi document person resolution"; it has recently gained much attention, among others through the Spock Challenge. The solution that is outlined below has been awarded the winner of this challenge.
The heart of our solution for the person resolution problem is similarity graph clustering. Let G=< V,E, φ > denote a weighted graph with node set V, edge set E, and edge weight function φ. Each node represents a document, and each edge e in E has a weight φ(e) in [0,1] that models the similarity of its incident documents with respect to their referents. Using such a graph-based approach, the challenge can be subdivided into the following steps.
- Modeling. Design of a similarity function φ that reflects page similarity according to individuals (also called "referents").
- Merging. Choice of a clustering algorithm to generate different clusterings using various parameters.
- Evaluation. Test the validity of the generated clusterings in order to choose the best.
In general, the modeling part constitutes the most critical and most important part in the analysis process. If the model is poor, no combination of clustering algorithm or validity measure can succeed in producing a high quality clustering.
For training purposes, we assume the existence of a training set of Web pages that has been labeled according to referents. At training time, a similarity measure in form of a classifier is learned from the training data, which is employed online for deriving similarities of previously unseen Web pages. Training procedure:
- (1a) For each document within an instance, several representations are computed. In particular, we compute a vector space representation using a TF*IDF weighting scheme, a core vocabulary representation, a semantic similarity model based on Wikipedia, a categorical model where the document is classified into the DMOZ hierarchy, and a high-level domain specific model that uses an NLP parser to extract specific knowledge.
- (2a) For each pair of documents, similarity values φi according to the mentioned document representations are computed.
- (3) A logistic regression is used to combine the similarity values φi of an edge e into a single value φ. The objective of the regression is to produce a value of φ(e)=1 if the incident documents of e belong to the same referent, and φ(e)=0 otherwise. In other words, the value of the logistic regression represents the probability that the documents incident to e belong to the same individual.
How the trained classifier is used:
- (1b) For each pair of documents in an instance, the similarity values φi are computed.
- (2b) The trained classifier is used to combine the φi into one value φ, which is used as edge weight in a similarity graph G.
- (4) The similarity graph is reduced to a k-nearest-neighbor graph in order to reduce noise.
- (5) The similarity graph is clustered with the MajorClust algorithm (Stein and Niggemann 1999) using different density thresholds. The resulting clusterings are assessed with respect to their validity using the expected density measure ρ (Stein and Meyer zu Eissen 2003, Meyer zu Eissen 2007).
Students: Steffen Becker, Christof Bräutigam, Tino Rüb, and Hagen-Christian Tönnies