|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectde.aitools.ir.retrievalmodels.representer.AbstractRepresenter<java.lang.String,Vector>
de.aitools.ir.retrievalmodels.representer.DivergenceFromRandomness
public class DivergenceFromRandomness
Divergence From Randomness (DfR) is retrieval model developed by Amati et al. It is based on the assumptions of Harter, that one can make assumptions about the significance of a term due to its distribution in the collection. Insignificant terms, like stop words, are distributed randomly over the whole document collection. They comply with a Poisson distribution. By contrast, informative terms can be found against the hypothesis of a Poisson distribution in a small subset of the document collection --- the elite set. The assumption is, that terms within the elite set are again Poisson distributed. This probabilistic model is called 2-Poisson-Model.
Amati et al. weight terms in the DfR model by two probability distributions. The first probability states, that words with little information are randomly distributed on the whole set of documents. Subsequently, the lower the probability is, the higher is the information gain. To describe the notion of randomness, they provide seven models --- including a Poisson model. The second probability in the scheme represents the risk to choose a term as a good descriptor for a document. The higher the risk, the higher is the gain in information, if the assumptions were wrong. The risk can be declared either by Laplace's "law of succession" or by an Bernoulli experiment.
Term frequencies are normalized by document lengths. A first hypothesis (H1) assumes, that all terms within a document are uniformly distributed. The second (H2) assumes, that the terms in short documents are more dense than in long documents. In experiments, the second hypothesis was favored.
A DfR model is described via a sequence of strings XYZ. X represents the basic model, Y the first normalization factor (the risk) and Z the second normalization factor (H1 or H2). The combination of DfR models that performs best in experiments was "In(e)B2". You can choose between the following types:
Basic models (to model the distribution of terms):
(1) Bose Einstein Statistics [BE ] (2) Divergence Model [D ] (3) Geometric Model [G ] (4) INQUERY System F [F ] (5) Tf Model [In ] (6) Tf Expected-Idf Model [Ine] (7) Poisson Model [P ]
First normalization (so-called after affect models):
(1) Laplace Normalization [L] (2) Bernoulli Normalization [B]
Second normalization (for term frequencies):
(1) Hypothesis 1 [H1] (2) Hypothesis 2 [H2]
Reference: Probabilistic models of information retrieval based on measuring the divergence from randomness, by Amati et. al, 2002.
Nested Class Summary | |
---|---|
static class |
DivergenceFromRandomness.BasicModel
|
static class |
DivergenceFromRandomness.N1
|
static class |
DivergenceFromRandomness.N2
|
Constructor Summary | |
---|---|
DivergenceFromRandomness(DivergenceFromRandomness.BasicModel x,
DivergenceFromRandomness.N1 y,
DivergenceFromRandomness.N2 z,
TermFrequency tf)
|
|
DivergenceFromRandomness(java.util.Locale l)
This constructor initializes the Divergence from Randomness model per default with 'In(e)B2'. |
|
DivergenceFromRandomness(java.util.Locale l,
DivergenceFromRandomness.BasicModel x,
DivergenceFromRandomness.N1 y,
DivergenceFromRandomness.N2 z)
Provides a DfR model with your preferred combination of a basic model x , a risk factor y , and a saturation function
y to normalize term frequencies. |
Method Summary | |
---|---|
boolean |
isTrained()
|
Vector |
represent(java.lang.String text)
|
void |
train(java.lang.Iterable<java.lang.String> texts,
boolean forceTraining)
|
Methods inherited from class de.aitools.ir.retrievalmodels.representer.AbstractRepresenter |
---|
train |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DivergenceFromRandomness(java.util.Locale l)
l
- used by the TermFrequency
DivergenceFromRandomness(Locale, BasicModel, N1, N2)
,
DivergenceFromRandomness(BasicModel, N1, N2, TermFrequency)
public DivergenceFromRandomness(java.util.Locale l, DivergenceFromRandomness.BasicModel x, DivergenceFromRandomness.N1 y, DivergenceFromRandomness.N2 z)
x
, a risk factor y
, and a saturation function
y
to normalize term frequencies.
l
- used by the TermFrequency
x
- one of the basic models defined by #BasicModely
- one of the risk functions defined by #N1z
- one of the term frequency normalizations defined by #N2DivergenceFromRandomness(BasicModel, N1, N2, TermFrequency)
public DivergenceFromRandomness(DivergenceFromRandomness.BasicModel x, DivergenceFromRandomness.N1 y, DivergenceFromRandomness.N2 z, TermFrequency tf)
x
- one of the basic models defined by #BasicModely
- one of the risk functions defined by #N1z
- one of the term frequency normalizations defined by #N2tf
- used to represent documentsMethod Detail |
---|
public Vector represent(java.lang.String text)
public boolean isTrained()
public void train(java.lang.Iterable<java.lang.String> texts, boolean forceTraining)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |