Evaluating Clustering in Machine Learning Evaluating Clustering in Machine Learning
Clustering has always been one of those topics that garnered my attention. Especially when I was first getting into the whole... Evaluating Clustering in Machine Learning

Clustering has always been one of those topics that garnered my attention. Especially when I was first getting into the whole sphere of machine learning, unsupervised clustering always carried an allure with it for me.

To put it simply, clustering is rather like the unsung knight in shining armor of machine learning. This form of unsupervised learning aims to bundle similar data points into groups.

Visualize yourself in a social gathering where everyone is a stranger.

How would you decipher the crowd?

Perhaps, by grouping individuals based on shared traits, such as those laughing at a joke, the football aficionados deep in conversation, or the group captivated by a literary discussion. That’s clustering in a nutshell!

You may wonder, “Why is it relevant?”.

Clustering boasts numerous applications.

  • Customer segmentation — helping businesses categorise their customers according to buying patterns to tailor their marketing approaches.
  • Anomaly detection — identify peculiar data points, like suspicious transactions in banking.
  • Optimised resource utilisation — by configuring computing clusters.

However, there’s a caveat.

How do we make sure that our clustering effort is successful?

How can we efficiently evaluate a clustering solution?

This is where the requirement for robust evaluation methods emerges.

Without a robust evaluation technique, we could potentially end up with a model that appears promising on paper, but drastically underperforms in practical scenarios.

In this article, we’ll examine two renowned clustering evaluation methods: the Silhouette score and Density-Based Clustering Validation (DBCV). We’ll dive into their strengths, limitations, and ideal scenarios of use.

The Importance of Clustering Evaluation

Having established an understanding of what clustering is, it’s now time to delve into why we need to evaluate clustering systems.

In machine learning, constructing a model constitutes only half the victory. The remaining half hinges on assessing its performance.

Throughout my professional career, I’ve unfortunately witnessed several instances (especially when it involved more junior employees) where clustering techniques were simply not evaluated. There’s a common misconception that we cannot simply evaluate the performance of unsupervised machine learning techniques.

For clustering, this assessment isn’t merely valuable — it’s absolutely indispensable.

Consider this analogy. Imagine yourself as a detective, faced with numerous clues. You assort these clues based on resemblances, aiming to crack the case. However, how do you validate if your groupings are correct? Even with all the appropriate clues at hand, a misstep in their effective grouping could steer you off course. Herein lies the essence of clustering evaluation. It’s the methodology that aids us in verifying if our ‘detective work’ is heading in the right direction.

Measuring our clustering results allows us to gauge the quality of our clusters.

It indicates whether similar data points have been aptly grouped and disparate ones suitably segregated.

This is incredibly important in many real-world scenarios. For example, in customer segmentation, erroneous clustering could result in ineffective marketing initiatives, adversely impacting a company’s financial health.

But what if we utilize unsuitable evaluation metrics?

Imagine you’re employing a metric that prefers spherical clusters when dealing with a dataset characterized by non-spherical clusters.

The consequence?

An inflated evaluation score that imparts a misleading sense of precision. It’s analogous to relying on a biased compass — it might convince you you’re venturing north when in reality, you’re veering south!

In the most unfavorable situations, these deceptive results could lead to flawed decision-making, squandered resources, and overlooked opportunities.

Common Methods for Clustering Evaluation

Numerous methods for cluster evaluation are available. Nevertheless, as previously mentioned, this article will cast the spotlight on two in particular: the Silhouette score and Density-Based Clustering Validation (DBCV).

Let’s kick things off with the Silhouette score, shall we?

The Silhouette Score

This metric has a bit of star status in the realm of clustering evaluation, largely due to its straightforwardness and interpretability.

  • The score quantifies how similar a data point is to its own cluster versus other clusters.
  • The score ranges from -1 to 1, with a high score suggesting that the data point is well aligned with its own cluster and poorly matched to adjacent clusters.
  • The score proves particularly useful when dealing with spherical or convex clusters. It’s frequently employed in customer segmentation scenarios where customers with akin purchasing habits are grouped.

Nonetheless, one of its weak spots is clusters of arbitrary shape.

The score tends to favor convex clusters, potentially yielding misleading results when applied to datasets with non-convex clusters. It’s akin to attempting to fit a square peg into a round hole — it simply doesn’t align!

In Python, we can calculate the Silhouette score as follows:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Let's create a hypothetical dataset
X, _ = make_blobs(n_samples=500,
                  center_box=(-10.0, 10.0),
                  random_state=1)  # For reproducibility

# Specify the number of clusters and fit the data
kmeans = KMeans(n_clusters=4, random_state=1).fit(X)

# Predict the cluster labels for the dataset
labels = kmeans.labels_

# Compute the Silhouette score
silhouette_avg = silhouette_score(X, labels)

print("The average silhouette score is :", silhouette_avg)

Density-Based Clustering Validation (DBCV)

This metric may not be as renowned as the Silhouette score, but in the right circumstances, it’s remarkably effective.

DBCV is powered through two main values:

  • the density within a cluster,
  • the density between clusters;

The term ‘density’ refers to the proximity of data points to each other. A high intra-cluster density indicates that the data points in that cluster are closely organized (i.e., suggesting a well-formed cluster). A low inter-cluster density suggests that the clusters themselves are well-separated, which is also a positive attribute.

This method excels when handling clusters of arbitrary shape. It doesn’t carry the same bias as the Silhouette score, making it a more adaptable tool in our clustering evaluation arsenal.

In sklearn there isn’t a direct validation score for DBCV. However, we can use the davies_bouldin_score instead. It works on the same principle of DBCV (i.e., the lower the better, and works by measuring inter/intra-cluster similarity).

from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs

# Let's create a hypothetical dataset
X, _ = make_blobs(n_samples=500,
                  center_box=(-10.0, 10.0),
                  random_state=1)  # For reproducibility

# Specify the number of clusters and fit the data
kmeans = KMeans(n_clusters=4, random_state=1).fit(X)

# Predict the cluster labels for the dataset
labels = kmeans.labels_

# Compute the Davies-Bouldin score
db_score = davies_bouldin_score(X, labels)

print("The Davies-Bouldin score is :", db_score)

There is also this Python implementation of DBCV.

Practical Application and Comparison of Evaluation Methods

Now, let’s consider a practical example where DBCV outshines the Silhouette score, and discuss some pertinent implications.

Suppose we’re dealing with a dataset composed of customer reviews spanning a variety of products.

The reviews are diverse, resulting in clusters of different shapes and sizes. Our chosen method is KMeans clustering, favored for its simplicity and efficiency.

However, upon utilizing the Silhouette score to evaluate our clustering, we were met with an unexpectedly high score. Sounds like excellent news, doesn’t it? But hold your horses!

Here’s the snag: the Silhouette score tends to prefer convex clusters, whilst our dataset comprises arbitrary-shaped clusters.

For example, consider the following:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

In the above illustration, we can clearly see 3 different clusters — where each cluster is represented by its own concentric circle.

If we were to use something like KMeans, which like the Silhouette score favors convex clusters, we’ll get the following result:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

Consequently, despite the elevated score, the clustering outcome from KMeans doesn’t meet our expectations.

It’s an archetypal scenario of a misleading metric.

Let’s change tack and employ DBCV for evaluation. Given its capability to handle arbitrary-shaped clusters, DBCV delivers a more precise evaluation of our KMeans clustering. It’s akin to obtaining a second opinion from a reliable source.

Using a technique which doesn’t directly favor convex clusters, in this example using Spectral clustering, we get a much more realistic result:

source: https://math.stackexchange.com/a/4139343 (CC BY-SA)

Note. The above plots were retrieved from the public discussion available here. I strongly recommend that you go through this answer since it provides a really solid ground for understanding this concept better.

DBCV’s advantages don’t end here, though.

One of its standout characteristics is its efficacy when ground truth labels aren’t at hand.

In numerous real-world circumstances, we don’t have the privilege of possessing ground truth labels. For instance, within our customer reviews dataset, we lack prior knowledge of how the reviews ought to be grouped. In such situations, DBCV comes to our aid, offering a dependable evaluation of our clustering.

To wrap up, while the Silhouette score is a practical tool for evaluating clustering, it doesn’t provide a universal solution.

Depending on our dataset’s characteristics and our task’s specific requirements, other methods like DBCV may prove more suitable.

Concluding Remarks

We started this article by comprehending what clustering is and appreciating its significance. We recognized its role as an unsung hero of machine learning, assembling similar data points together in a plethora of applications, ranging from customer segmentation to anomaly detection.

Next, we embarked on understanding the ‘why’ behind clustering evaluation, recognizing its crucial role in appraising the quality of our clusters. We observed how using the incorrect evaluation metrics could steer us towards deceptive results.

We then examined two notable evaluation methods: the Silhouette score and Density-Based Clustering Validation (DBCV).

We learnt how the Silhouette score, despite its popularity and ease of interpretation, harbors a bias towards convex clusters.

Conversely, DBCV, with its inter/intra-cluster density calculation, emerged as a more adaptable instrument, particularly for clusters of arbitrary shape.

We briefly discussed a practical example where DBCV outperforms the Silhouette score.

However, if there’s one central lesson to take away from our expedition, it’s this: the significance of selecting the appropriate evaluation method that aligns with the specific traits of the clustering problem at hand.

Article originally posted here by David Farrugia.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.