Assessment Metrics for Clustering Algorithms Assessment Metrics for Clustering Algorithms
Assessing the quality of your model is one of the most important considerations when deploying any machine learning algorithm. For supervised learning problems, this... Assessment Metrics for Clustering Algorithms

Assessing the quality of your model is one of the most important considerations when deploying any machine learning algorithm.

For supervised learning problems, this is easy. There are already labels for every example, so the practitioner can test the model’s performance on a reserved evaluation set.

We don’t have that luxury when we’re dealing with unlabeled data in unsupervised learning contexts. There’s nothing to even test since there’s no ground truth — the idea of testing in this arena is a flawed premise.

That doesn’t mean assessing the model is a lost cause. Numerous metrics examine the quality of clustering results when labeled data is unavailable. These metrics can give the practitioner insight into how the clusters might change depending on the algorithm’s selection and the natural tendency of the data to group together.

A Word of Warning

Before you throw these metrics around, understand that they do not measure the validity of the model’s predictions. Remember, we don’t have any reasonable way to determine how valid the cluster predictions are.

Instead, the metrics evaluate the comparative performance of models against each other in terms of some heuristic metric.

To illustrate the point, think about what would happen if you were asked to group together 100 different bands. You might come up with some funky clustering where the Beatles, Black Sabbath, and Sex Pistols are all in the same cluster. Someone else comes up with a different clustering that places the Beatles with Simon and Garfunkel, Black Sabbath with Led Zeppelin, and Sex Pistols with Black Flag. Maybe there’s no ground truth, but we have some intuition that the second clustering is better because it places each band in a cluster of bands ‘closer’ to their own sound.

All that is to say that we’re not measuring the veracity of our clusters. We’re just interested in which algorithm can make clusters where the points are most similar, as measured by our metric.

To that point, take care not to apply these metrics to data with embedded structures that might render your metric useless. If your metric assumes convexity but the data is naturally non-convex, then the metric is useless for that algorithm.

That said, let’s look at some of the internal measures you can deploy on clustering algorithms to measure the relative quality of different models.

Davies-Bouldin Index

The DB Index is calculated by the following formula:

where n is the number of clusters and σi is the average distance of all points in cluster i from the cluster centroid ci.

The DB index captures the intuition that clusters that are (1) well-spaced from each other and (2) themselves very dense are likely a ‘good’ clustering. This is because the measure’s ‘max’ statement repeatedly selects the values where the average point is farthest away from its centroid, and where the centroids are closest together. As the DB index shrinks, the clustering is considered ‘better’.

Dunn Index

The formula for the Dunn Index is as follows:

where i, j and k are each indices for clusters, d measures the inter-cluster distance and d’ measures the intra-cluster difference.

The Dunn Index captures the same idea as the DB Index: it gets better when clusters are well-spaced and dense. But the Dunn Index increases as performance improves.

What differs is the way this problem is approached. While the DB index considers the dispersion and separation of all clusters, the Dunn Index only considers the worst cases in the clustering: the clusters that are closest together and the single most dispersed cluster. Depending on your application, the change in objective may introduce unexpected problems.

It’s really up to you which of the two approaches to work with.

Silhouette Coefficient

The Silhouette Coefficient is measured like so:

where a(i) is the average distance of point i from all other points in its cluster and b(i) is the smallest average distance of i to all points in any other cluster. To clarify, b(i) is found by measuring the average distance of i from every point in cluster A, the average distance of i from every point in cluster B, and taking the smallest resulting value.

The Silhouette Coefficient tells us how well-assigned each individual point is. If S(i) is close to 0, it is right at the inflection point between two clusters. If it is closer to -1, then we would have been better off assigning it to the other cluster. If S(i) is close to 1, then the point is well-assigned and can be interpreted as belonging to an ‘appropriate’ cluster.

The Silhouette Coefficient is a very intuitive and sophisticated measure of distance. Its downfall is that it can be extremely expensive to compute on all n points. This is because we must compute the distance of i from all other n – 1 points for each i, which leads to a complexity of O(n2).

Many practitioners will balk at that cautious assessment and shrug, saying it’s less than NP. However, for very large datasets these time complexities can become unmanageable.

These are just three of the most popular methods for assessing cluster quality. There are a variety of other techniques, but these will get you a long way on their own. They’ll certainly give you more insight about your model’s accuracy than blind trust.


Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!

Spencer Norris

Spencer Norris

Spencer Norris is a data scientist and freelance journalist. He currently works as a contractor and publishes on his blog on Medium: https://medium.com/@spencernorris