K-means clustering is a partitioning approach for unsupervised statistical learning. It is somewhat unlike agglomerative approaches like hierarchical clustering. A partitioning approach starts with all data points and tries to divide them into a fixed number of clusters.
K-means is applied to a set of quantitative variables. We fix the number of clusters in advance and must guess where the centers (called “centroids”) of those clusters are. We work to assign points to the closest centroid and then recalculate the centroids to iterate through this clustering approach.
When we speak about the accuracy of a statistical learning algorithm, we refer to a measure of comparing the true label to the predicted label. The K-Means clustering algorithm works on “unlabeled” data sets where a predicted label does not exist. This means K-Means clustering evaluation cannot directly apply accuracy as supervised methods can. There are however, some measurements that you can use to evaluate clusters.
Within Cluster Sum of Squares
One measurement is Within Cluster Sum of Squares (WCSS), which measures the squared average distance of all the points within a cluster to the cluster centroid. To calculate WCSS, you first find the Euclidean distance (see figure below) between a given point and the centroid to which it is assigned. You then iterate this process for all points in the cluster, and then sum the values for the cluster and divide by the number of points. Finally, you calculate the average across all clusters. This will give you the average WCSS.
Essentially, WCSS measures the variability of the observations within each cluster. In general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares. Clusters that have higher values exhibit greater variability of the observations within the cluster.
From another perspective, WCSS is influenced by the number of observations. As the number of observations increases, you’ll notice that the sum of squares increases. This means that WCSS is often not directly comparable across clusters with different numbers of observations. To compare within-cluster variabilities of different clusters, you should use the average distance from centroid.
Between Clusters Sum of Squares
Another measurement is Between Clusters Sum of Squares (BCSS), which measures the squared average distance between all centroids. To calculate BCSS, you find the Euclidean distance from a given cluster centroid to all other cluster centroids. You then iterate this process for all of the clusters, and sum all of the values together. This value is the BCSS. You can divide by the number of clusters to calculate the average BCSS.
Essentially, BCSS measures the variation between all clusters. A large value can indicate clusters that are spread out, while a small value can indicate clusters that are close to each other.
Other Cluster Metrics
There are a number of other metrics for K-means clustering that can help you hone your use of this unsupervised learning method. Here is a short list to investigate:
- To get help choosing the optimal k value for K-means clustering, you can use the Elbow Method.
- Silhouette Coefficient – you can use the Scikit-learn function silhouette_score to compute the mean Silhouette Coefficient of all samples.
- The Gap Statistic Method compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data, i.e. a distribution with no obvious clustering.
- Here is a nice tutorial on K-means clustering (including a mathematical foundation) using R code examples, along with the use of WCSS and BCSS.