# Visualization for Clustering Methods

Data VisualizationModelingWest 2023posted by ODSC Community August 28, 2023 ODSC Community

*Editor’s note: Evie Fowler is a speaker for ODSC West. Be sure to check out her talk, “Bridging the Interpretability Gap in Customer Segmentation,” there!*

At this Fall’s Open Data Science Conference, I will talk about how to bring a systematic approach to the interpretation of clustering models. To get ready for that, let’s talk about data visualization for clustering models.

## Preparing a Workspace

All of these visualizations can be created with the basic tools of data manipulation (pandas and numpy) and the basics of visualization (matplotlib and seaborn).

from matplotlib import colormaps, pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import load_diabetes from sklearn.preprocessing import MinMaxScaler import numpy as np import pandas as pd import seaborn as sns

For this tutorial, I’ll use the diabetes prediction dataset built into matplotlib. I’ll offer a lot more insight on how to train and evaluate an effective clustering model at ODSC, but for now, I’ll just fit a few simple k-means models.

# load diabetes data diabetesData = load_diabetes(as_frame = True).data # center and scale clusterable features diabetesScaler = MinMaxScaler().fit(diabetesData) diabetesDataScaled = pd.DataFrame(diabetesScaler.transform(diabetesData) , columns = diabetesData.columns , index = diabetesData.index) # build three small clustering models km3 = KMeans(n_clusters = 3).fit(diabetesDataScaled) km4 = KMeans(n_clusters = 4).fit(diabetesDataScaled) km10 = KMeans(n_clusters = 10).fit(diabetesDataScaled)

## Choosing a Color Scheme

The matplotlib package provides a number of built-in color schemes through its colormaps registry. It is convenient to choose one colormap for the entirety of a visualization, and important to choose thoughtfully. That can mean evaluating everything from whether the map is sequential (for when data can be interpreted along a scale from low to high) or divergent (for when data is most relevant at either of two extremes) to whether it is thematically appropriate for the subject (greens and browns for a topography project). When there is no particular relationship between the data and the order it will be presented in, the nipy_spectral colormap is a good choice.

# choose the nipy_spectral colormap from matplotlib nps = colormaps['nipy_spectral'] # view the whole colormap nps

Each matplotlib colormap consists of a series of tuples, with each describing a color in RGBA format (though with components scaled to [0, 1] rather than [0, 255]). Individual colors from the map can be accessed either by integer (between 0 and 255) or by float (between 0 and 1). Numbers close to 0 correspond to colors at the lower end of the color map, while integers close to 255 and floats close to 1.0 correspond to colors at the upper end of the color map. Intuitively, the same color can be described by either an integer, or a float representing that integer as a quotient of 255.

# view select colors from the colormap print(nps(51)) #(0.0, 0.0, 0.8667, 1.0) print(nps(0.2)) #(0.0, 0.0, 0.8667, 1.0)

## Creating Visualizations

### Scatter Plots

The classic visualization for a clustering model is a series of scatter plots comparing each pair of features that went into the clustering model, with cluster assignment denoted by color. There are built in methods to achieve this, but a DIY approach gives more control over details like the color scheme.

def plotScatters(df, model): """ Create scatter plots based on each pair of columns in a dataframe. Use color to denote model label. """ # create a figure and axes plotRows = df.shape[1] plotCols = df.shape[1] fig, axes = plt.subplots( # create one row and one column for each feature in the dataframe plotRows, plotCols # scale up the figure size for easy viewing , figsize = ((plotCols * 3), (plotRows * 3)) ) # iterate through subplots to create scatter plots pltindex = 0 for i in np.arange(0, plotRows): for j in np.arange(0, plotCols): pltindex += 1 # identify the current subplot plt.subplot(plotRows, plotCols, pltindex) plt.scatter( # compare the i-th and j-th features of the dataframe df.iloc[:, j], df.iloc[:, i] # use integer cluster labels and a color map to unify color selection , c = model.labels_, cmap = nps # choose a small marker size to reduce overlap , s = 1) # label the x axis on the bottom row of sub plots if i == df.shape[1] - 1: plt.xlabel(df.columns[j]) # label the y axis on the first column of sub plots if j == 0: plt.ylabel(df.columns[i]) plt.show()

These plots do double duty, showing the relationship between a pair of features and the relationship between those features and cluster assignment.

plotScatters(diabetesDataScaled, km3)

As analysis progresses, it’s easy to focus on a smaller subset of features.

plotScatters(diabetesDataScaled.iloc[:, 2:7], km4)

### Violin Plots

To get a better sense of the distribution of each feature within each cluster, we can also look at violin plots. If you’re not familiar with violin plots, think of them as the grown up cousin of the classic box plot. Where box plots identify only a few key descriptors of a distribution, violin plots are contoured to illustrate the entire probability density function.

def plotViolins(df, model, plotCols = 5): """ Create violin plots of each feature in a dataframe Use model labels to group. """ # calculate number of rows needed for plot grid plotRows = df.shape[1] // plotCols while plotRows * plotCols < df.shape[1]: plotRows += 1 # create a figure and axes fig, axes = plt.subplots(plotRows, plotCols # scale up the figure size for easy viewing , figsize = ((plotCols * 3), (plotRows * 3)) ) # identify unique cluster labels from model uniqueLabels = sorted(np.unique(model.labels_)) # create a custom subpalette from the unique labels # this will return npsTemp = nps([x / max(uniqueLabels) for x in uniqueLabels]) # add integer cluster labels to input dataframe df2 = df.assign(cluster = model.labels_) # iterate through subplots to create violin plots pltindex = 0 for col in df.columns: pltindex += 1 plt.subplot(plotRows, plotCols, pltindex) sns.violinplot( data = df2 # use cluster labels as x grouper , x = 'cluster' # use current feature as y values , y = col # use cluster labels and custom palette to unify color selection , hue = model.labels_ , palette = npsTemp ).legend_.remove() # label y axis with feature name plt.ylabel(col) plt.show() plotViolins(diabetesDataScaled, km3, plotCols = 5)

### Histograms

Violin plots show the distribution of each feature within each cluster, but it is also helpful to look at how each cluster is represented in the broader distribution of each feature. A modified histogram can illustrate this well.

def histogramByCluster(df, labels, plotCols = 5, nbins = 30, legend = False, vlines = False): """ Create a histogram of each feature. Use model labels to color code. """ # calculate number of rows needed for plot grid plotRows = df.shape[1] // plotCols while plotRows * plotCols < df.shape[1]: plotRows += 1 # identify unique cluster labels uniqueLabels = sorted(np.unique(labels)) # create a figure and axes fig, axes = plt.subplots(plotRows, plotCols # scale up the figure size for easy viewing , figsize = ((plotCols * 3), (plotRows * 3)) ) pltindex = 0 # loop through features in input data for col in df.columns: # discretize the feature into specified number of bins tempBins = np.trunc(nbins * df[col]) / nbins # cross the discretized feature with cluster labels tempComb = pd.crosstab(tempBins, labels) # create an index in the same size as the cross tab # this will help with alignment ind = np.arange(tempComb.shape[0]) # identify the relevant subplot pltindex += 1 plt.subplot(plotRows, plotCols, pltindex) # create grouped histogram data histPrep = {} # work one cluster at a time for lbl in uniqueLabels: histPrep.update( { # associate the cluster label... lbl: # ... with a bar chart plt.bar( # use the feature-specific index to set x locations x = ind # use the counts associated with this cluster as bar height , height = tempComb[lbl] # stack this bar on top of previous cluster bars , bottom = tempComb[[x for x in uniqueLabels if x < lbl]].sum(axis = 1) # eliminate gaps between bars , width = 1 , color = nps(lbl / max(uniqueLabels)) ) } ) # use feature name to label x axis of each plot plt.xlabel(col) # label the y axis of plots in the first column if pltindex % plotCols == 1: plt.ylabel('Frequency') plt.xticks(ind[0::5], np.round(tempComb.index[0::5], 2)) # if desired, overlay vertical lines if vlines: for vline in vlines: plt.axvline(x = vline * ind[-1], lw = 0.5, color = 'red') if legend: leg1 = []; leg2 = [] for key in histPrep: leg1 += [histPrep[key]] leg2 += [str(key)] plt.legend(leg1, leg2) plt.show() histogramByCluster(diabetesDataScaled, km4.labels_)

This process scales easily when more cluster categories are needed.

histogramByCluster(diabetesDataScaled, km10.labels_)

## Conclusion

These visualizations will provide a strong base for evaluating clustering models. For more about how to do so in a systematic way, be sure to come to my talk at this Fall’s Open Data Science Conference in San Francisco!

## About the Author:

Evie Fowler is a data scientist based in Pittsburgh, Pennsylvania. She currently works in the healthcare sector leading a team of data scientists who develop predictive models centered on the patient care experience. She holds a particular interest in the ethical application of predictive analytics and in exploring how qualitative methods can inform data science work. She holds an undergraduate degree from Brown University and a master’s degree from Carnegie Mellon.