de.aitools.ir.retrievalmodels.representer
Class DivergenceFromRandomness

java.lang.Object
  extended by de.aitools.ir.retrievalmodels.representer.AbstractRepresenter<java.lang.String,Vector>
      extended by de.aitools.ir.retrievalmodels.representer.DivergenceFromRandomness
All Implemented Interfaces:
Representer<java.lang.String,Vector>, java.io.Serializable

public class DivergenceFromRandomness
extends AbstractRepresenter<java.lang.String,Vector>

Divergence From Randomness (DfR) is retrieval model developed by Amati et al. It is based on the assumptions of Harter, that one can make assumptions about the significance of a term due to its distribution in the collection. Insignificant terms, like stop words, are distributed randomly over the whole document collection. They comply with a Poisson distribution. By contrast, informative terms can be found against the hypothesis of a Poisson distribution in a small subset of the document collection --- the elite set. The assumption is, that terms within the elite set are again Poisson distributed. This probabilistic model is called 2-Poisson-Model.

Amati et al. weight terms in the DfR model by two probability distributions. The first probability states, that words with little information are randomly distributed on the whole set of documents. Subsequently, the lower the probability is, the higher is the information gain. To describe the notion of randomness, they provide seven models --- including a Poisson model. The second probability in the scheme represents the risk to choose a term as a good descriptor for a document. The higher the risk, the higher is the gain in information, if the assumptions were wrong. The risk can be declared either by Laplace's "law of succession" or by an Bernoulli experiment.

Term frequencies are normalized by document lengths. A first hypothesis (H1) assumes, that all terms within a document are uniformly distributed. The second (H2) assumes, that the terms in short documents are more dense than in long documents. In experiments, the second hypothesis was favored.

A DfR model is described via a sequence of strings XYZ. X represents the basic model, Y the first normalization factor (the risk) and Z the second normalization factor (H1 or H2). The combination of DfR models that performs best in experiments was "In(e)B2". You can choose between the following types:

Basic models (to model the distribution of terms):

 (1) Bose Einstein Statistics [BE ]
 (2) Divergence Model         [D  ]
 (3) Geometric Model          [G  ]
 (4) INQUERY System F         [F  ]
 (5) Tf Model                 [In ]
 (6) Tf Expected-Idf Model    [Ine]
 (7) Poisson Model            [P  ]
 

First normalization (so-called after affect models):

 (1) Laplace Normalization    [L]
 (2) Bernoulli Normalization  [B]
 

Second normalization (for term frequencies):

 (1) Hypothesis 1             [H1]
 (2) Hypothesis 2             [H2]
 

Reference: Probabilistic models of information retrieval based on measuring the divergence from randomness, by Amati et. al, 2002.

Version:
aitools 3.0 Created on Mar 28, 2010 $Id: DivergenceFromRandomness.java,v 1.1 2012/04/23 13:44:42 hoppe Exp $
Author:
dennis.hoppe@uni-weimar.de
See Also:
Serialized Form

Nested Class Summary
static class DivergenceFromRandomness.BasicModel
           
static class DivergenceFromRandomness.N1
           
static class DivergenceFromRandomness.N2
           
 
Constructor Summary
DivergenceFromRandomness(DivergenceFromRandomness.BasicModel x, DivergenceFromRandomness.N1 y, DivergenceFromRandomness.N2 z, TermFrequency tf)
           
DivergenceFromRandomness(java.util.Locale l)
          This constructor initializes the Divergence from Randomness model per default with 'In(e)B2'.
DivergenceFromRandomness(java.util.Locale l, DivergenceFromRandomness.BasicModel x, DivergenceFromRandomness.N1 y, DivergenceFromRandomness.N2 z)
          Provides a DfR model with your preferred combination of a basic model x, a risk factor y, and a saturation function y to normalize term frequencies.
 
Method Summary
 boolean isTrained()
           
 Vector represent(java.lang.String text)
           
 void train(java.lang.Iterable<java.lang.String> texts, boolean forceTraining)
           
 
Methods inherited from class de.aitools.ir.retrievalmodels.representer.AbstractRepresenter
train
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DivergenceFromRandomness

public DivergenceFromRandomness(java.util.Locale l)
This constructor initializes the Divergence from Randomness model per default with 'In(e)B2'. This combination performed best in experiments carried out by the author of DfR.

Parameters:
l - used by the TermFrequency
See Also:
DivergenceFromRandomness(Locale, BasicModel, N1, N2), DivergenceFromRandomness(BasicModel, N1, N2, TermFrequency)

DivergenceFromRandomness

public DivergenceFromRandomness(java.util.Locale l,
                                DivergenceFromRandomness.BasicModel x,
                                DivergenceFromRandomness.N1 y,
                                DivergenceFromRandomness.N2 z)
Provides a DfR model with your preferred combination of a basic model x, a risk factor y, and a saturation function y to normalize term frequencies.

Parameters:
l - used by the TermFrequency
x - one of the basic models defined by #BasicModel
y - one of the risk functions defined by #N1
z - one of the term frequency normalizations defined by #N2
See Also:
DivergenceFromRandomness(BasicModel, N1, N2, TermFrequency)

DivergenceFromRandomness

public DivergenceFromRandomness(DivergenceFromRandomness.BasicModel x,
                                DivergenceFromRandomness.N1 y,
                                DivergenceFromRandomness.N2 z,
                                TermFrequency tf)
Parameters:
x - one of the basic models defined by #BasicModel
y - one of the risk functions defined by #N1
z - one of the term frequency normalizations defined by #N2
tf - used to represent documents
Method Detail

represent

public Vector represent(java.lang.String text)

isTrained

public boolean isTrained()

train

public void train(java.lang.Iterable<java.lang.String> texts,
                  boolean forceTraining)