de.aitools.ir.retrievalmodels.representer
Class OkapiBM25

java.lang.Object
  extended by de.aitools.ir.retrievalmodels.representer.AbstractRepresenter<java.lang.String,Vector>
      extended by de.aitools.ir.retrievalmodels.representer.OkapiBM25
All Implemented Interfaces:
Representer<java.lang.String,Vector>, java.io.Serializable

public class OkapiBM25
extends AbstractRepresenter<java.lang.String,Vector>

Okapi BM25 is a retrieval model developed by Robertson et al. in the early 90's. We use it here as a term weighting scheme. BM25 is a normalization between BM11 and BM15. The latter incorporates no normalization regarding the document length. Robertson argues, that a normalization is essential, because different authors are more or less verbose. BM25 can be seen as a tf-idf model, whereas the term frequency component is actually a non-linear saturation function. We can look on BM25 as a "Divergence from Randomness" (DfR) model, too. Here, the idf component corresponds to the DfR over the whole collection and the tf component is the DfR contribution of the document itself.

The weighting scheme has to be trained initially to compute the the average document length and the df vector. BM25 depends on some parameters, which are explained for each constructor. Per default, this implementation takes b = 0.75, k1 = 1.2 and k3 = 8 as parameters. k2 is always zero. We do not perform a global correction of the document length for queries.

 Reference: 
 
 Okapi at TREC-4, by Robertson et. al, 1995.
 

Version:
aitools 3.0 Created on Mar 28, 2010 $Id: OkapiBM25.java,v 1.1 2010/05/19 15:52:03 poma1006 Exp $
Author:
[email protected]
See Also:
Serialized Form

Nested Class Summary
 class OkapiBM25.DocumentState
          Default state to represent text.
 class OkapiBM25.QueryState
          Represent queries.
static interface OkapiBM25.RepresentationState
           
 
Field Summary
static double B
           
static double K1
           
static double K3
           
 
Constructor Summary
OkapiBM25(double k1, double k3, double b, TermFrequency tf)
           
OkapiBM25(java.util.Locale l)
          Constructor initializes the BM25 formula with k1 = 1.2 and b = 0.75.
OkapiBM25(java.util.Locale l, double b)
           
OkapiBM25(java.util.Locale l, double k1, double k3, double b)
           
 
Method Summary
 double getB()
           
 double getK1()
           
 double getK3()
           
 boolean isTrained()
           
 Vector represent(java.lang.String text)
           
 void setB(double b)
           
 void setK1(double k1)
           
 void setK3(double k3)
           
 void setState(OkapiBM25.RepresentationState state)
          Changes the way of representing text.
 void train(java.lang.Iterable<java.lang.String> texts, boolean forceTraining)
           
 
Methods inherited from class de.aitools.ir.retrievalmodels.representer.AbstractRepresenter
train
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

B

public static final double B
See Also:
Constant Field Values

K1

public static final double K1
See Also:
Constant Field Values

K3

public static final double K3
See Also:
Constant Field Values
Constructor Detail

OkapiBM25

public OkapiBM25(java.util.Locale l)
Constructor initializes the BM25 formula with k1 = 1.2 and b = 0.75.

Parameters:
l - the locale to be used by TermFrequency
See Also:
OkapiBM25(Locale, double), OkapiBM25(Locale, double, double, double), #OkapiBM25(TermFrequency, double, double, double)

OkapiBM25

public OkapiBM25(java.util.Locale l,
                 double b)
Parameters:
l - the locale to be used by TermFrequency
b - if b takes 0, the best match formula yields to BM15 without any length normalization. Assigning 1 to b yields to BM11 with length normalization. Every value between 0 and 1 is allowed and results in a soft normalization between BM15 and BM11
See Also:
OkapiBM25(Locale, double, double, double), OkapiBM25(double, double, double, TermFrequency)

OkapiBM25

public OkapiBM25(java.util.Locale l,
                 double k1,
                 double k3,
                 double b)
Parameters:
l - the locale to be used by TermFrequency
k1 - is used to approximate the shape of the saturation function tf/(k1+tf). k1 must be greater than 0.
b - if b takes 0, the best match formula yields to BM15 without any length normalization. Assigning 1 to b yields to BM11 with length normalization. Every value between 0 and 1 is allowed and results in a soft normalization between BM15 and BM11
See Also:
OkapiBM25(double, double, double, TermFrequency)

OkapiBM25

public OkapiBM25(double k1,
                 double k3,
                 double b,
                 TermFrequency tf)
Parameters:
tf - an instance of TermFrequency
k1 - is used to approximate the shape of the saturation function tf/(k1+tf). k1 must be greater than 0.
b - if b takes 0, the best match formula yields to BM15 without any length normalization. Assigning 1 to b yields to BM11 with length normalization. Every value between 0 and 1 is allowed and results in a soft normalization between BM15 and BM11.
Method Detail

setState

public void setState(OkapiBM25.RepresentationState state)
Changes the way of representing text.

Parameters:
state - a new state which defines how the text is represented

represent

public Vector represent(java.lang.String text)

train

public void train(java.lang.Iterable<java.lang.String> texts,
                  boolean forceTraining)

isTrained

public boolean isTrained()

getK1

public final double getK1()

setK1

public final void setK1(double k1)

getK3

public final double getK3()

setK3

public final void setK3(double k3)

getB

public final double getB()

setB

public final void setB(double b)