de.aitools.dm.clustering.algorithms
Class KMeans

java.lang.Object
  extended by de.aitools.dm.clustering.algorithms.ASoftClusterer
      extended by de.aitools.dm.clustering.algorithms.AClusterer
          extended by de.aitools.dm.clustering.algorithms.KMeans
All Implemented Interfaces:
Clusterer, SoftClusterer

public class KMeans
extends AClusterer

This class clusters Vectors using the KMeans algorithm.

This algorithm places randomly k centroids corresponding to a cluster. The centroids are the averages of the clusters. There are several steps that are performed.
In the assignment step the proximity of all vectors with each centroid is computed and the vector is assigned to the cluster which centroid had the highest proximity.
In the update step the centroids are recomputed.
Both steps are repeated until no vector changed cluster in the last iteration.
If the algorithm was set to be incrementally updating, the centroids are recomputed after a vector changed cluster.
Parameters:


This implementation uses the the algorithm published in kmeans++: The Advantages of Careful Seeding for choosing the start centroids. The first one is chosen randomly from all vectors/points. All other centroids are chosen iteratively, with each vector having a probability proportional to the squared distance to the nearest centroid. In this implementation distance of two points is measured as the difference between the similiarity of two points and the self-proximity. This should work for all implementations of Proximity.
Randomization is seeded using the default random seed for all cluster algorithms if not said otherwise.
For more information see:

Tan Pang-Ning, Steinbach Michael, Kumar Vipin (2006).
Introduction to Data Mining.
Pearson Education, Boston, MA.

BibTeX:
 @book{TanSteinbachKumar2006,
   address = {Boston, MA},
   author = {Tan Pang-Ning and Steinbach Michael and Kumar Vipin},
   publisher = {Pearson Education},
   title = {Introduction to Data Mining},
   year = {2006}
 }
 
And for the KMeans++ initialization:

Arthur David and Vassilvitskii Sergei (2007).
kmeans++: The Advantages of Careful Seeding

BibTeX:
 @article{ArthurVassilvitskii2007,
   author = {Arthur David and Vassilvitskii Sergei},
   title = {kmeans++: The Advantages of Careful Seeding},
   year = {2007}
 }
 

Version:
$Id: KMeans.java,v 1.24 2011/11/09 13:40:46 hoppe Exp $
Author:
johannes.kiesel(/\t)uni-weimar.de

Field Summary
 
Fields inherited from interface de.aitools.dm.clustering.Clusterer
DEFAULT_SEED
 
Constructor Summary
KMeans()
          KMeans using default settings.
KMeans(Configuration c)
          Constructor with a specified Configuration.
KMeans(int k, Proximity<Vector> proximityMeasure, boolean updateIncremental)
          The constructor to use in most cases.
The default seed for randomization is used.
KMeans(int k, Proximity<Vector> proximityMeasure, boolean updateIncremental, long seed)
          If one does not want to try out another seed value, use the constructor with the default seed value (without the randomSeed parameter) instead.
 
Method Summary
 int[] cluster(Vector[] data)
           
 Vector[] getCentroids()
          Returs the centroids of the clusters that were found during the last clustering process.
 boolean getIfUpdateIncrementally()
          Returns true if the update is done incrementally.
 int getNumberOfClusters()
          Gets the number of clusters.
 Proximity<Vector> getProximity()
          Get the proximity.
 long getRandomSeed()
          Gets the seed.
static void main(java.lang.String[] args)
          Runs the main.
 void setIncrementalUpdating(boolean updateIncremental)
          Sets the updating incrementally.
 void setNumClusters(int k)
          Sets the number of clusters.
 void setProximityMeasure(Proximity<Vector> proximityMeasure)
          Sets a proximity measure.
 void setRandomSeed(long randomSeed)
          Sets the random seed for the random number generator.
 
Methods inherited from class de.aitools.dm.clustering.algorithms.AClusterer
cluster, cluster, cluster, clusterSoft
 
Methods inherited from class de.aitools.dm.clustering.algorithms.ASoftClusterer
clusterSoft, clusterSoft, clusterSoft, getBiggestRange
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KMeans

public KMeans(Configuration c)
Constructor with a specified Configuration.

Parameters:
c - Configuration

KMeans

public KMeans()
KMeans using default settings. This should only be used for testing or a quick try.


KMeans

public KMeans(int k,
              Proximity<Vector> proximityMeasure,
              boolean updateIncremental)
The constructor to use in most cases.
The default seed for randomization is used.

Parameters:
k - Number of clusters to find
proximityMeasure - The Proximity to use
updateIncremental - If to use incremental updating

KMeans

public KMeans(int k,
              Proximity<Vector> proximityMeasure,
              boolean updateIncremental,
              long seed)
If one does not want to try out another seed value, use the constructor with the default seed value (without the randomSeed parameter) instead.

Parameters:
k - Number of clusters to find
proximityMeasure - The Proximity to use
updateIncremental - If to use incremental updating
seed - The seed to be used for randomization
Method Detail

setNumClusters

public void setNumClusters(int k)
Sets the number of clusters.

Parameters:
k - the number of clusters to be generated

setRandomSeed

public void setRandomSeed(long randomSeed)
Sets the random seed for the random number generator.

Parameters:
randomSeed - Seed to be used for initializing the random number generator. This is used in each call to cluster(Vector[]). It is used for initialization of the start centroids and for randomizing the order at which data vectors will be assigned to clusters if using incremental updating (see setIncrementalUpdating(boolean))

setProximityMeasure

public void setProximityMeasure(Proximity<Vector> proximityMeasure)
Sets a proximity measure.

Parameters:
proximityMeasure - The Proximity to use

setIncrementalUpdating

public void setIncrementalUpdating(boolean updateIncremental)
Sets the updating incrementally.

Parameters:
updateIncremental - if to update incrementally

getNumberOfClusters

public int getNumberOfClusters()
Gets the number of clusters.

Returns:
The number of clusters to be generated

getRandomSeed

public long getRandomSeed()
Gets the seed.

Returns:
Seed to be used for initializing the random number generator. This is used in each call to cluster(Vector[]). It is used for initialization of the start centroids and for randomizing the order at which data vectors will be assigned to clusters if using incremental updating (see setIncrementalUpdating(boolean))

getProximity

public Proximity<Vector> getProximity()
Get the proximity.

Returns:
The proximity to use.

getIfUpdateIncrementally

public boolean getIfUpdateIncrementally()
Returns true if the update is done incrementally.

Returns:
true if the update is done incrementally, otherwise false

cluster

public int[] cluster(Vector[] data)
Specified by:
cluster in interface Clusterer
Specified by:
cluster in class AClusterer

getCentroids

public Vector[] getCentroids()
Returs the centroids of the clusters that were found during the last clustering process.

Returns:
An array of centroids represented by vectors:
Vector[] data = input data;
int[] assignment = kmeans.cluster(data);
Vector[] centroids = kmeans.getCentroids();
Vector centroidOfDataI = centroids[assignment[i]];

centroidOfDataI will now point to the centroid data[i] was assigned to at the end of the clustering process.

main

public static void main(java.lang.String[] args)
Runs the main.

Parameters:
args -