de.aitools.dm.clustering.algorithms
Class KNNHAC

java.lang.Object
  extended by de.aitools.dm.clustering.algorithms.ASoftClusterer
      extended by de.aitools.dm.clustering.algorithms.AClusterer
          extended by de.aitools.dm.clustering.algorithms.KNNHAC
All Implemented Interfaces:
Clusterer, SoftClusterer

public final class KNNHAC
extends AClusterer

Class for clustering Vectors using the Hierarchical Agglomerative Clustering algorithm (EfficientHAC) described in Introduction to Information Retrieval
. In Agglomerative Hierarchical Clustering at the start of the algorithm all data points are a cluster themselves. In each iteration two clusters are merged until only the desired number of clusters are left. The two clusters are chosen, which Proximity is highest. However, there are different methods to define the proximity between two clusters. Implemented at the moment are:

Note that Centroid, Median and Ward's Method need a Distance measure as Proximity. For this implementation all measures have to be symmetric.
The implementation uses Lance-Williams coefficients for the different clustering methods as in Multivariate Analysemethoden.
As stated by the formula, the proximity between any cluster Q and a newly merged cluster R (merged from cluster A and B) can be computed by a linear function taking as input the similarities between Q, A and B as well as the number of data points in those clusters. Note that there are other methods of Agglomerative Hierarchical Clustering that can not be expressed in this way, but all of the methods implemented (and described above) are.
This algorithm runs in O((n - k) n log n), with n being the number of data points and k being the number of desired clusters.
References:
Backhaus Klaus, Erichson Bernd, Plinke Wulff, Weiber Rolf (2006).
Multivariate Analysemethoden.
Springer, Berlin, Heidelberg, New York.

BibTeX:
 @book{BackhausErichsonPlinkeWeiber2006,
   author = {Backhaus Klaus and Erichson Bernd and Plinke Wulff and Weiber Rolf},
   publisher = {Springer},
   title = {Multivariate Analysemethoden},
   year = {2006}
 }
 

Manning Christopher D., Raghavan Prabhakar, Schütze Hinrich (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY.

BibTeX:
 @book{ManningRaghavanHinrich2008,
   address = {New York, NY},
   author = {Manning Christopher D. and Raghavan Prabhakar and Schütze Hinrich},
   publisher = {Cambridge University Press},
   title = {Introduction to Information Retrieval},
   year = {2008}
 }
 

Tan Pang-Ning, Steinbach Michael, Kumar Vipin (2006).
Introduction to Data Mining.
Pearson Education, Boston, MA.

BibTeX:
 @book{TanSteinbachKumar2006,
   address = {Boston, MA},
   author = {Tan Pang-Ning and Steinbach Michael and Kumar Vipin},
   publisher = {Pearson Education},
   title = {Introduction to Data Mining},
   year = {2006}
 }
 

G. N. Lance and W. T. Williams (1967).
A general theory of classificatory sorting strategies. 1. Hierarchical Systems
Computer Journal, Vol. 9, p. 373.

BibTeX:
 @article{LanceWilliams1967,
   author = {Lance G. N. and Williams W. T.},
   publisher = {Computer Journal},
   title = {A general theory of classificatory sorting strategies. 1. Hierarchical Systems},
   year = {1967}
 }
 

Version:
$Id: KNNHAC.java,v 1.6 2011/11/09 13:40:46 hoppe Exp $
Author:
johannes.kiesel(/\t)uni-weimar.de

Field Summary
 
Fields inherited from interface de.aitools.dm.clustering.Clusterer
DEFAULT_SEED
 
Constructor Summary
KNNHAC(Configuration configuration)
          Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC).
KNNHAC(HACClusterMethod clusterMethod, Proximity<Vector> proximityMeasure)
          Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC) using the default value for the number of neighbors (see setNumberOfNeighbors(int)).
KNNHAC(HACClusterMethod clusterMethod, Proximity<Vector> proximityMeasure, int numNeighbors)
          Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC).
 
Method Summary
 int[] cluster(Vector[] data)
          This method is used for clustering via the TIRA Framework.
 int[] cluster(Vector[] data, double threshold)
          Cluster given data hierarchically until the proximities between all clusters is less or equal to threshold.
 int[] cluster(Vector[] data, int numClusters)
          Cluster given data hierarchically until only numClusters are left.
 Dendrogram<DoubleMerge> clusterDendrogram(Vector[] data)
          Cluster given data hierarchically.
 HACClusterMethod getClusterMethod()
           
 int getNumberOfNeighbors()
           
 Proximity<Vector> getProximityMeasure()
           
static void main(java.lang.String[] args)
           
 void setClusterMethod(HACClusterMethod clusterMethod)
          This is a general implementation of a hierarchical agglomerative clustering algorithm (HAC).
 void setNumberOfClusters(int numClusters)
          Made for integration into the TIRA Framework.
 void setNumberOfNeighbors(int numNeighbors)
          This sets the number of neighbors for the K-Nearest-Neighbor-Graph that is used in this algorithm.
 void setProximityMeasure(Proximity<Vector> proximityMeasure)
          Sets the proximity measure to be used for the clustering steps.
 
Methods inherited from class de.aitools.dm.clustering.algorithms.AClusterer
cluster, cluster, cluster, clusterSoft
 
Methods inherited from class de.aitools.dm.clustering.algorithms.ASoftClusterer
clusterSoft, clusterSoft, clusterSoft, getBiggestRange
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KNNHAC

public KNNHAC(HACClusterMethod clusterMethod,
              Proximity<Vector> proximityMeasure)
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC) using the default value for the number of neighbors (see setNumberOfNeighbors(int)).

Parameters:
clusterMethod - As in setClusterMethod(HACClusterMethod).
proximityMeasure - As in setProximityMeasure(Proximity).
See Also:
KNNHAC(HACClusterMethod, Proximity, int), KNNHAC(Configuration)

KNNHAC

public KNNHAC(HACClusterMethod clusterMethod,
              Proximity<Vector> proximityMeasure,
              int numNeighbors)
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC).

Parameters:
clusterMethod - As in setClusterMethod(HACClusterMethod).
proximityMeasure - As in setProximityMeasure(Proximity).
numNeighbors - As in setNumberOfNeighbors(int).
See Also:
KNNHAC(HACClusterMethod, Proximity), KNNHAC(Configuration)

KNNHAC

public KNNHAC(Configuration configuration)
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (KNNHAC).

Parameters:
configuration - Object for configuring this clusterer:
See Also:
KNNHAC(HACClusterMethod, Proximity), KNNHAC(HACClusterMethod, Proximity, int)
Method Detail

setClusterMethod

public void setClusterMethod(HACClusterMethod clusterMethod)
This is a general implementation of a hierarchical agglomerative clustering algorithm (HAC). The concrete method it uses can be changed using this method.

Parameters:
clusterMethod - The cluster method to use.

setNumberOfNeighbors

public void setNumberOfNeighbors(int numNeighbors)
This sets the number of neighbors for the K-Nearest-Neighbor-Graph that is used in this algorithm. The normal general HAC-Algorithm needs O(N2) space. This can be used to lower this requirement.
See KNNGraph.createUndirectedKNNIntGraph( Vector[], Proximity, int, double) for an explanation.

Parameters:
numNeighbors - Number of neighbors for the graph as shown above.

setProximityMeasure

public void setProximityMeasure(Proximity<Vector> proximityMeasure)
Sets the proximity measure to be used for the clustering steps.

Parameters:
proximityMeasure - The measure to use.

setNumberOfClusters

public void setNumberOfClusters(int numClusters)
Made for integration into the TIRA Framework. As it uses cluster(Vector[]) through which the number of clusters can not further be specified, this method (or KNNHAC(Configuration)) can be used to tell the algorithm how many clusters to generate.

Parameters:
numClusters - Number of clusters to generate. Must be greater than zero.

getClusterMethod

public HACClusterMethod getClusterMethod()
Returns:
The cluster method to use.
See Also:
setClusterMethod(HACClusterMethod)

getProximityMeasure

public Proximity<Vector> getProximityMeasure()
Returns:
The proximity measure to use.
See Also:
setProximityMeasure(Proximity)

getNumberOfNeighbors

public int getNumberOfNeighbors()
Returns:
Number of neighbors in the k-nearest-neighbor-graph that will be created during clustering. See setNumberOfNeighbors(int) for more information.

cluster

public int[] cluster(Vector[] data,
                     int numClusters)
Cluster given data hierarchically until only numClusters are left.

Parameters:
data - The vectors to cluster.
numClusters - Number of clusters to generate.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
See Also:
cluster(Vector[], double), clusterDendrogram(Vector[])

cluster

public int[] cluster(Vector[] data,
                     double threshold)
Cluster given data hierarchically until the proximities between all clusters is less or equal to threshold.

Parameters:
data - The vectors to cluster.
threshold - The threshold for clustering.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
See Also:
cluster(Vector[], int), clusterDendrogram(Vector[])

cluster

public int[] cluster(Vector[] data)
This method is used for clustering via the TIRA Framework. The number of clusters to generate is taken from the constructor KNNHAC(Configuration) or from setNumberOfClusters(int). If you are not using TIRA, you can use clusterDendrogram(Vector[]) to get a complete dendrogram of the clustering process.

Specified by:
cluster in interface Clusterer
Specified by:
cluster in class AClusterer
Parameters:
data - The vectors to cluster.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
See Also:
KNNHAC(Configuration), setNumberOfClusters(int), clusterDendrogram(Vector[])

clusterDendrogram

public Dendrogram<DoubleMerge> clusterDendrogram(Vector[] data)
Cluster given data hierarchically.

Parameters:
data - The data to be clustered.
Returns:
A Dendrogram of the clustering process.

main

public static void main(java.lang.String[] args)
Parameters:
args -