Tuesday 5 October 2010

K-Means Clustering

K-Means is one of the most popular, classic and simple approaches to clustering. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics [3].





The code presented here is also part of the Accord.NET Framework. The Accord.NET Framework is a framework for developing machine learning, computer vision, computer audition, statistics and math applications. It is based on the already excellent AForge.NET Framework. Please see the starting guide for mode details. The latest version of the framework includes the latest version of this code plus many other statistics and machine learning tools.

Contents



  1. Introduction

    1. Lloyd's K-Mean algorithm



  2. Source code

  3. Using the code

  4. Sample application

  5. Conclusion

  6. Acknowledgements

  7. References

  8. See also



Introduction


In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean [4]. In its most common form, the algorithm is an iterative greedy algorithm which converges to a local optimum after a certain number of iterations.



kmeans

Illustration of the K-Means algorithm.

The algorithm works by first selecting k locations at random to be the initial centroids for the clusters. Each observation is then assigned to the cluster which has the nearest centroid, and the centroids are recalculated using the mean value of assigned values. The algorithm then repeats this process until the cluster centroids do not change anymore, or until the change less than a given threshold.



There are other refinements and extensions of the algorithm. The version depicted above is its most common form, also referred as Lloyd's algorithm.



Lloyd's K-Means algorithm



  1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.

  2. Assign each object to the group that has the closest centroid.

  3. When all objects have been assigned, recalculate the positions of the K centroids.

  4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.


 



Source code


The source code has been implemented using Accord.NET and is now part of the framework. In the current version (2.1.2), the following classes related to K-Means are contained inside the Accord.MachineLearning namespace. The source code available in this page contains only the parts of the framework actually needed to support the algorithm.



Class diagram for K-MeansClass diagram for the K-Means algorithm.



The KMeans class is the main class representing the K-Means algorithm. The algorithm itself is implemented in the Compute(double[][] data, double threshold) method, which accepts a set of observations and a convergence threshold to determine when the method should stop.








/// <summary>
/// Divides the input data into K clusters.
/// </summary>
public int[] Compute(double[][] data, double threshold)
{
int k = this.K;
int rows = data.Length;
int cols = data[0].Length;


// pick K unique random indexes in the range 0..n-1
int[] idx = Accord.Statistics.Tools.Random(rows, k);

// assign centroids from data set
this.centroids = data.Submatrix(idx);


// initial variables
int[] count = new int[k];
int[] labels = new int[rows];
double[][] newCentroids;


do // main loop
{

// Reset the centroids and the
// cluster member counters'
newCentroids = new double[k][];
for (int i = 0; i < k; i++)
{
newCentroids[i] = new double[cols];
count[i] = 0;
}


// First we will accumulate the data points
// into their nearest clusters, storing this
// information into the newClusters variable.

// For each point in the data set,
for (int i = 0; i < data.Length; i++)
{
// Get the point
double[] point = data[i];

// Compute the nearest cluster centroid
int c = labels[i] = Nearest(data[i]);

// Increase the cluster's sample counter
count[c]++;

// Accumulate in the corresponding centroid
double[] centroid = newCentroids[c];
for (int j = 0; j < centroid.Length; j++)
centroid[j] += point[j];
}

// Next we will compute each cluster's new centroid
// by dividing the accumulated sums by the number of
// samples in each cluster, thus averaging its members.
for (int i = 0; i < k; i++)
{
double[] mean = newCentroids[i];
double clusterCount = count[i];

for (int j = 0; j < cols; j++)
mean[j] /= clusterCount;
}


// The algorithm stops when there is no further change in
// the centroids (difference is less than the threshold).
if (centroids.IsEqual(newCentroids, threshold))
break;

// go to next generation
centroids = newCentroids;

}
while (true);


// Compute cluster information (optional)
for (int i = 0; i < k; i++)
{
// Extract the data for the current cluster
double[][] sub = data.Submatrix(labels.Find(x => x == i));

// Compute the current cluster variance
covariances[i] = Statistics.Tools.Covariance(sub, centroids[i]);

// Compute the proportion of samples in the cluster
proportions[i] = (double)sub.Length / data.Length;
}


// Return the classification result
return labels;
}


 



The implementation is quite straightforward and does not use additional techniques to avoid convergence problems. More refined techniques may be added to the implementation in the future, so please make sure to download the latest version of Accord.NET Framework for the most up-to-date revision of the code.



Using the code



To use the code, create a new instance of the KMeans class passing the desired number of clusters to its constructor. Additionally, you may also pass a distance function to be used as a distance metric during clustering. The default is to use the square Euclidean distance.





    // Declare some observations
double[][] observations =
{
new double[] { -5, -2, -1 },
new double[] { -5, -5, -6 },
new double[] { 2, 1, 1 },
new double[] { 1, 1, 2 },
new double[] { 1, 2, 2 },
new double[] { 3, 1, 2 },
new double[] { 11, 5, 4 },
new double[] { 15, 5, 6 },
new double[] { 10, 5, 6 },
};

// Create a new K-Means algorithm with 3 clusters
KMeans kmeans = new KMeans(3);

// Compute the algorithm, retrieving an integer array
// containing the labels for each of the observations
int[] labels = kmeans.Compute(observations);

// As a result, the first two observations should belong to the
// same cluster (thus having the same label). The same should
// happen to the next four observations and to the last three.


 


Sample application


The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are often used to aid border detection and object recognition. The sample application performs image segmentation using the standard squared Euclidean distance over RGB pixel color space. There are, however, better distance metrics to be used for image segmentation, such as weighted distances and other color spaces, which will not be addressed in this example.



kmeans1



Original image (from Ossi Petruska Flickr page*).



To perform image segmentation, we will first translate our image into an array of pixel values. The single image will be read, pixel by pixel, into a jagged array where each element is a double array of length 3. Each element of those double array will contain one of the three RGB values scaled to the interval [–1,1].



After we perform clustering on this array of pixel values, each pixel will have an associated cluster label. Each of these values will then be swapped by its corresponding cluster centroid. The source code below is called when one clicks the Run button in the application.




private void btnRun_Click(object sender, EventArgs e)
{
// Retrieve the number of clusters
int k = (int)numClusters.Value;

// Load original image
Bitmap image = Properties.Resources.leaf;

// Transform the image into an array of pixel values
double[][] pixels = image.ToDoubleArray();


// Create a K-Means algorithm using given k and a
// square euclidean distance as distance metric.
KMeans kmeans = new KMeans(k, Distance.SquareEuclidean);

// Compute the K-Means algorithm until the difference in
// cluster centroids between two iterations is below 0.05
int[] idx = kmeans.Compute(pixels, 0.05);


// Replace every pixel with its corresponding centroid
pixels.ApplyInPlace((x, i) => kmeans.Clusters.Centroids[idx[i]]);

// Show resulting image in the picture box
pictureBox.Image = pixels.ToBitmap(image.Width, image.Height);
}




After segmentation, the following resulting images can be obtained:



kmeans5



Same image after K-Means clustering with k = 5.



kmeans10



Image after K-Means clustering with k = 10.



 



* The sample image used above has been licensed by Ossi Petruska in a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic license.



Conclusion


K-Means is a very simple and popular approach to clustering. The implementation presented here is the same implementation used in Accord.NET. As it can be seem, the method can be easily extended with custom distance functions through delegates or lambda expressions, and can be used in different contexts, such as image segmentation, without further modifications. As a suggestion for improvement, the method can be further speeded up by using the triangle inequality as suggested on the paper "Using the triangle inequality to accelerate k-means", by Charles Elkan.



In the next article, we will see how we can use K-Means to initialize the Expectation-Maximization algorithm for estimating Gaussian Mixture densities in Gaussian Mixture Models. Those articles will form the basis for Continuous density Hidden Markov Models.



Acknowledgements


To Antonino Porcino, who provided the first version of the code and for the valuable information about many other methods and algorithms.



References




See also


1 comment: