Understanding the Mathematics behind K-Means Clustering

In this post, we’re going to dive deep into one of the most influential unsupervised learning algorithms—k-means clustering. K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms, and we’ll be discussing how the algorithm works, distance and accuracy metrics, and a lot more.

What is meant by unsupervised learning?

Unsupervised learning is a type of self-organized learning that aids us in discovering patterns in our data related to various features. It is one of the three main categories of machine learning, along with supervised and reinforcement learning.

Two of the main methods used in unsupervised learning are principal component anaylsis and cluster analysis. To learn more about principal component analysis, refer to this article.

What is Clustering?

Clustering is the process of dividing the data space or data points into a number of groups, such that data points in the same groups are more similar to other data points in the same group, and dissimilar to the data points in other groups.

Clustering Objectives

The major objective of clustering is to find patterns (i.e. similarities within data points) in an unlabeled dataset and cluster them together. But how do we decide what constitutes a good clustering? There isn’t a definitive best way of clustering, which would be independent of the final aim of the clustering. The end results usually depends on users and the parameters they select, focusing on the most important features used for clustering.

Applications of Clustering in Real-World problems

Vector quantization

K-means originates from signal processing, but it’s also used for vector quantization. For example, color quantization is the task of reducing the color palette of an image to a fixed number of colors k. The k-means algorithm can easily be used for this task.

Psychology and Medicine

An illness or condition frequently has a number of variations, and cluster analysis can be used to identify these different subcategories. For example, clustering has been used to identify different types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.

Recommender Systems

Clustering can also be used in recommendation engines. In the case of recommending movies to someone, you can look at the movies enjoyed by a user and then use clustering to find similar movies.

For a detailed discussion on recommender systems, refer to this series.

Document Clustering

This is another common application of clustering. Let’s say you have multiple documents and you need to cluster similar documents together. Clustering helps us group these documents such that similar documents are in the same clusters.

Image Segmentation

Image segmentation is a wide-spread application of clustering. Similar pixels in the image are grouped together. We can apply this technique to create clusters having similar pixels in the same group.

The k-means clustering algorithm

K-means clustering is a prototype-based, partitional clustering technique that attempts to ﬁnd a user-speciﬁed number of clusters (k), which are represented by their centroids.

Procedure

We ﬁrst choose k initial centroids, where k is a user-speciﬁed parameter; namely, the number of clusters desired. Each point is then assigned to the closest centroid, and each collection of points assigned to a centroid is called a cluster. The centroid of each cluster is then updated based on the points assigned to the cluster. We repeat the assignment and update steps until no point changes clusters, or similarly, until the centroids remain the same.

Proximity Measures

For clustering, we need to define a proximity measure for two data points. Proximity here means how similar/dissimilar the samples are with respect to each other.

Similarity measure is large if features are similar.
Dissimilarity measure is small if features are similar.

Data in Euclidean Space

Consider data whose proximity measure is Euclidean distance. For our objective function, which measures the quality of a clustering, we use the sum of the squared error (SSE), which is also known as scatter.

In other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest centroid, and then compute the total sum of the squared errors. Given two diﬀerent sets of clusters that are produced by two diﬀerent runs of K-means, we prefer the one with the smallest squared error, since this means that the prototypes (centroids) of this clustering are a better representation of the points in their cluster.

Document Data

To illustrate that K-means is not restricted to data in Euclidean space, we consider document data and the cosine similarity measure:

Implementation in scikit-learn

It merely takes four lines to apply the algorithm in Python with sklearn: import the classifier, create an instance, fit the data on the training set, and predict outcomes for the test set:

Parameter tuning in scikit-learn

n_clusters-int, default=8. n_clusters defines the number of clusters to form, as well as the number of centroids to generate.
max_iter-int, default=300. Refers to the maximum number of iterations of the k-means clustering algorithm for a single run.
algorithm{“auto”, “full”, “elkan”}, default=”auto”. Refers to which K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but it currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.

Time and Space Complexity

The space requirements for k-means clustering are modest, because only the data points and centroids are stored. Speciﬁcally, the storage required is O((m + K)n), where m is the number of points and n is the number of attributes. The time requirements for k-means are also modest — basically linear in terms of the number of data points. In particular, the time required is O(I∗K∗m∗n), where I is the number of iterations required for convergence.

Choosing Initial Centroids

When random initialization of centroids is used, diﬀerent runs of K-means typically produce diﬀerent total SSEs. Choosing the proper initial centroids is the key step of the basic K-means procedure. A common approach is to choose the initial centroids randomly, but the resulting clusters are often poor.

Another technique that’s commonly used to address the problem of choosing initial centroids is to perform multiple runs, each with a diﬀerent set of randomly-chosen initial centroids, and then select the set of clusters with the minimum SSE.

But often, random initialization leads to sub-optimal results, and may not work well in cases with clusters of different shapes and densities, or centroids located too far or too close to each other. This can result in overlapping clusters of different classes, or the distribution of clusters belonging to the same class.

Bisecting k-means: An Improvement

The bisecting k-means algorithm is a straightforward extension of the basic k-means algorithm that’s based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select one of these clusters to split, and
so on, until k clusters have been produced. This helps in minimizing the SSE and results in an optimal clustering.

Choosing K

There can be various methods to determine the optimal value of k for convergence of the algorithm and to make clear distinction between clusters or different classes in a dataset.

Elbow Method

There’s a popular method known as elbow method, which is used to determine the optimal value of k to perform clustering. The basic idea behind this method is that it plots the various values of cost with changing k. The point where this distortion declines the most is the elbow point, which works as an optimal value of k.

Silhouette Method

In the silhouette method, we assume that the data has already been clustered into k clusters by k-means clustering. Then for each data point, we define the following:

C(i): The cluster assigned to the ith data point
|C(i)|: The number of data points in the cluster assigned to the ith data point
a(i): Gives a measure of how well assigned the ith data point is to it’s cluster

b(i): Defined as the average dissimilarity to the closest cluster which is not it’s cluster

The silhouette coefficient s(i) is given by:

We determine the average silhouette for each value of k, and the value of k that has the maximum value of s(i) is considered the optimal number of clusters for the unsupervised learning algorithm.

The Curse of Dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings, such as the three-dimensional physical space of everyday experience.

The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance.

In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high-dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

In case of k-means clustering, the curse of dimensionality results in difficulty in clustering data due to vast data space. For example, with Euclidean space as a proximity measure, two data points that may be very dissimilar could be grouped together because, due to too many dimensions, somehow, their net distance from the centroid is comparable.

Advantages of k-means clustering

K-means clustering is relatively simple to implement, and can be implemented without using frameworks—just simple programming language, specifying one’s own proximity measures.
The algorithm is known to easily adapt to new examples.
It guarantees convergence by trying to minimize the total SSE as an objective function over a number of iterations.
The algorithm is fast and efficient in terms of computational cost, which is typically O(K*n*d).

Disadvantages of k-means clustering

Choosing k manually. This is the greatest factor in the convergence of the algorithm and can provide widely different results for different values of k.
Clustering data of varying sizes and density. K-means doesn’t perform well with clusters of different sizes, shapes, and density. To cluster such data, you need to generalize k-means.
Clustering outliers. Outliers must be removed before clustering, or they may affect the position of the centroid or make a new cluster of their own.
Being dependent on initial values. As the value of k increases, other algorithms(i.e. k-means seeding) need to be applied to give better values for the initial centroids.
Scaling with number of dimensions. As the number of dimensions increases, the difficulty in getting the algorithm to converge increases due to the curse of dimensionality, discussed above.
If there is overlapping between clusters, k-means doesn’t have an intrinsic measure for uncertainty; thus it’s difficult to identify which points in the overlapping region should be assigned to which class.

How to prepare your data for k-means clustering

The algorithm provides best results when the data points are well separated from each other; thus, we must ensure that all the data points are the most similar to their centroid and as different as possible from the other centroids. Various iterations are required for convergence, and we can also use methods like splitting clusters, choosing one centroid randomly, and placing the next centroid as far from the previously chosen one as possible. All of these techniques can help reduce the overall SSE.
Scale/standardize the data when applying k-means algorithm—because it is dependent of the distances of the data points from the centroid, if all the features are not scaled, some features may dominate the data space and lead to biased results.

Sources to get started with K-means clustering

Here are a few sources which will help you to implement k-means on your dataset:

Conclusion

In this post, we read about k-means clustering in detail and gained insights about the mathematics behind it. Despite being widely used and strongly supported, it has its share of advantages and disadvantages.

Let me know if you liked the article and how I can improve it. All feedback is welcome. Check out my other articles in the series: Understanding the mathematics behind Naive Bayes, Support Vector Machines and Principal Component Analysis.

I’ll be exploring the mathematics involved in other foundational machine learning algorithms in future posts, so stay tuned.