de.aitools.dm.clustering.algorithms Class KNNHAC

```java.lang.Object
de.aitools.dm.clustering.algorithms.ASoftClusterer
de.aitools.dm.clustering.algorithms.AClusterer
de.aitools.dm.clustering.algorithms.KNNHAC
```
All Implemented Interfaces:
Clusterer, SoftClusterer

`public final class KNNHACextends AClusterer`

Class for clustering `Vector`s using the Hierarchical Agglomerative Clustering algorithm (EfficientHAC) described in Introduction to Information Retrieval
. In Agglomerative Hierarchical Clustering at the start of the algorithm all data points are a cluster themselves. In each iteration two clusters are merged until only the desired number of clusters are left. The two clusters are chosen, which `Proximity` is highest. However, there are different methods to define the proximity between two clusters. Implemented at the moment are:

• Single Link The proximity of two clusters is the proximity of the most proximate points in the two clusters.
• Complete Link The proximity of two clusters is the proximity of the points that are least proximate in the two clusters.
• Unweighted Average Link The proximity of two clusters is the average of both clusters.
• Weighted Average Link The proximity of two clusters is the average pairwise proximity of all pairs of points from the different clusters.
• Centroid The proximity of two clusters is the proximity of the cluster centroids.
• Median Much like the Centroid-Method, the centroid after a merge is computed as if the two clusters would have had the same number of data points.
• Ward's Method The proximity of two clusters is the proximity in terms of the increase of the SSE (Summed Squared Error) that would result from merging the two clusters (based on their centroids).
Thus always the merge is taken that results in the smallest increase of the SSE
Note that Centroid, Median and Ward's Method need a `Distance` measure as `Proximity`. For this implementation all measures have to be symmetric.
The implementation uses Lance-Williams coefficients for the different clustering methods as in Multivariate Analysemethoden.
As stated by the formula, the proximity between any cluster Q and a newly merged cluster R (merged from cluster A and B) can be computed by a linear function taking as input the similarities between Q, A and B as well as the number of data points in those clusters. Note that there are other methods of Agglomerative Hierarchical Clustering that can not be expressed in this way, but all of the methods implemented (and described above) are.
This algorithm runs in O((n - k) n log n), with n being the number of data points and k being the number of desired clusters.
References:
Backhaus Klaus, Erichson Bernd, Plinke Wulff, Weiber Rolf (2006).
Multivariate Analysemethoden.
Springer, Berlin, Heidelberg, New York.

BibTeX:
``` @book{BackhausErichsonPlinkeWeiber2006,
author = {Backhaus Klaus and Erichson Bernd and Plinke Wulff and Weiber Rolf},
publisher = {Springer},
title = {Multivariate Analysemethoden},
year = {2006}
}
```

Manning Christopher D., Raghavan Prabhakar, Schütze Hinrich (2008).
Introduction to Information Retrieval.
Cambridge University Press, New York, NY.

BibTeX:
``` @book{ManningRaghavanHinrich2008,
author = {Manning Christopher D. and Raghavan Prabhakar and Schütze Hinrich},
publisher = {Cambridge University Press},
title = {Introduction to Information Retrieval},
year = {2008}
}
```

Tan Pang-Ning, Steinbach Michael, Kumar Vipin (2006).
Introduction to Data Mining.
Pearson Education, Boston, MA.

BibTeX:
``` @book{TanSteinbachKumar2006,
author = {Tan Pang-Ning and Steinbach Michael and Kumar Vipin},
publisher = {Pearson Education},
title = {Introduction to Data Mining},
year = {2006}
}
```

G. N. Lance and W. T. Williams (1967).
A general theory of classificatory sorting strategies. 1. Hierarchical Systems
Computer Journal, Vol. 9, p. 373.

BibTeX:
``` @article{LanceWilliams1967,
author = {Lance G. N. and Williams W. T.},
publisher = {Computer Journal},
title = {A general theory of classificatory sorting strategies. 1. Hierarchical Systems},
year = {1967}
}
```

Version:
\$Id: KNNHAC.java,v 1.6 2011/11/09 13:40:46 hoppe Exp \$
Author:
johannes.kiesel(/\t)uni-weimar.de

Field Summary

Fields inherited from interface de.aitools.dm.clustering.Clusterer
`DEFAULT_SEED`

Constructor Summary
`KNNHAC(Configuration configuration)`
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`).
```KNNHAC(HACClusterMethod clusterMethod, Proximity<Vector> proximityMeasure)```
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`) using the default value for the number of neighbors (see `setNumberOfNeighbors(int)`).
```KNNHAC(HACClusterMethod clusterMethod, Proximity<Vector> proximityMeasure, int numNeighbors)```
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`).

Method Summary
` int[]` `cluster(Vector[] data)`
This method is used for clustering via the TIRA Framework.
` int[]` ```cluster(Vector[] data, double threshold)```
Cluster given data hierarchically until the proximities between all clusters is less or equal to threshold.
` int[]` ```cluster(Vector[] data, int numClusters)```
Cluster given data hierarchically until only numClusters are left.
` Dendrogram<DoubleMerge>` `clusterDendrogram(Vector[] data)`
Cluster given data hierarchically.
` HACClusterMethod` `getClusterMethod()`

` int` `getNumberOfNeighbors()`

` Proximity<Vector>` `getProximityMeasure()`

`static void` `main(java.lang.String[] args)`

` void` `setClusterMethod(HACClusterMethod clusterMethod)`
This is a general implementation of a hierarchical agglomerative clustering algorithm (HAC).
` void` `setNumberOfClusters(int numClusters)`
Made for integration into the TIRA Framework.
` void` `setNumberOfNeighbors(int numNeighbors)`
This sets the number of neighbors for the K-Nearest-Neighbor-Graph that is used in this algorithm.
` void` `setProximityMeasure(Proximity<Vector> proximityMeasure)`
Sets the proximity measure to be used for the clustering steps.

Methods inherited from class de.aitools.dm.clustering.algorithms.AClusterer
`cluster, cluster, cluster, clusterSoft`

Methods inherited from class de.aitools.dm.clustering.algorithms.ASoftClusterer
`clusterSoft, clusterSoft, clusterSoft, getBiggestRange`

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

KNNHAC

```public KNNHAC(HACClusterMethod clusterMethod,
Proximity<Vector> proximityMeasure)```
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`) using the default value for the number of neighbors (see `setNumberOfNeighbors(int)`).

Parameters:
`clusterMethod` - As in `setClusterMethod(HACClusterMethod)`.
`proximityMeasure` - As in `setProximityMeasure(Proximity)`.
`KNNHAC(HACClusterMethod, Proximity, int)`, `KNNHAC(Configuration)`

KNNHAC

```public KNNHAC(HACClusterMethod clusterMethod,
Proximity<Vector> proximityMeasure,
int numNeighbors)```
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`).

Parameters:
`clusterMethod` - As in `setClusterMethod(HACClusterMethod)`.
`proximityMeasure` - As in `setProximityMeasure(Proximity)`.
`numNeighbors` - As in `setNumberOfNeighbors(int)`.
`KNNHAC(HACClusterMethod, Proximity)`, `KNNHAC(Configuration)`

KNNHAC

`public KNNHAC(Configuration configuration)`
Create a new K-Nearest-Neighbor-Hierarchical-Agglomerative-Clusterer (`KNNHAC`).

Parameters:
`configuration` - Object for configuring this clusterer:
`KNNHAC(HACClusterMethod, Proximity)`, `KNNHAC(HACClusterMethod, Proximity, int)`
Method Detail

setClusterMethod

`public void setClusterMethod(HACClusterMethod clusterMethod)`
This is a general implementation of a hierarchical agglomerative clustering algorithm (HAC). The concrete method it uses can be changed using this method.

Parameters:
`clusterMethod` - The cluster method to use.

setNumberOfNeighbors

`public void setNumberOfNeighbors(int numNeighbors)`
This sets the number of neighbors for the K-Nearest-Neighbor-Graph that is used in this algorithm. The normal general HAC-Algorithm needs O(N2) space. This can be used to lower this requirement.
See ```KNNGraph.createUndirectedKNNIntGraph( Vector[], Proximity, int, double)``` for an explanation.
• By default (numNeighbors = 0) this uses the square root of the number of input points.
• If numNeighbors > 0, this will be used as the number of neighbors. However, the number of neighbors will never exceed number of points - 1.
• If numNeighbors < 0, this value will be subtracted from the number of points to get the number of neighbors.
So a value of -1 will result in creating the complete graph (without the self-similarity-edge). This will never be less than 1.

Parameters:
`numNeighbors` - Number of neighbors for the graph as shown above.

setProximityMeasure

`public void setProximityMeasure(Proximity<Vector> proximityMeasure)`
Sets the proximity measure to be used for the clustering steps.

Parameters:
`proximityMeasure` - The measure to use.

setNumberOfClusters

`public void setNumberOfClusters(int numClusters)`
Made for integration into the TIRA Framework. As it uses `cluster(Vector[])` through which the number of clusters can not further be specified, this method (or `KNNHAC(Configuration)`) can be used to tell the algorithm how many clusters to generate.

Parameters:
`numClusters` - Number of clusters to generate. Must be greater than zero.

getClusterMethod

`public HACClusterMethod getClusterMethod()`
Returns:
The cluster method to use.
`setClusterMethod(HACClusterMethod)`

getProximityMeasure

`public Proximity<Vector> getProximityMeasure()`
Returns:
The proximity measure to use.
`setProximityMeasure(Proximity)`

getNumberOfNeighbors

`public int getNumberOfNeighbors()`
Returns:
Number of neighbors in the k-nearest-neighbor-graph that will be created during clustering. See `setNumberOfNeighbors(int)` for more information.

cluster

```public int[] cluster(Vector[] data,
int numClusters)```
Cluster given data hierarchically until only numClusters are left.

Parameters:
`data` - The vectors to cluster.
`numClusters` - Number of clusters to generate.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
`cluster(Vector[], double)`, `clusterDendrogram(Vector[])`

cluster

```public int[] cluster(Vector[] data,
double threshold)```
Cluster given data hierarchically until the proximities between all clusters is less or equal to threshold.

Parameters:
`data` - The vectors to cluster.
`threshold` - The threshold for clustering.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
`cluster(Vector[], int)`, `clusterDendrogram(Vector[])`

cluster

`public int[] cluster(Vector[] data)`
This method is used for clustering via the TIRA Framework. The number of clusters to generate is taken from the constructor `KNNHAC(Configuration)` or from `setNumberOfClusters(int)`. If you are not using TIRA, you can use `clusterDendrogram(Vector[])` to get a complete dendrogram of the clustering process.

Specified by:
`cluster` in interface `Clusterer`
Specified by:
`cluster` in class `AClusterer`
Parameters:
`data` - The vectors to cluster.
Returns:
Cluster assignment c such that
c[i] = cluster-id of the cluster to which data[i] was assigned.
`KNNHAC(Configuration)`, `setNumberOfClusters(int)`, `clusterDendrogram(Vector[])`

clusterDendrogram

`public Dendrogram<DoubleMerge> clusterDendrogram(Vector[] data)`
Cluster given data hierarchically.

Parameters:
`data` - The data to be clustered.
Returns:
A `Dendrogram` of the clustering process.

main

`public static void main(java.lang.String[] args)`
Parameters:
`args` -