fbpx
Visualization for Clustering Methods Visualization for Clustering Methods
Editor’s note: Evie Fowler is a speaker for ODSC West. Be sure to check out her talk, “Bridging the Interpretability Gap... Visualization for Clustering Methods

Editor’s note: Evie Fowler is a speaker for ODSC West. Be sure to check out her talk, “Bridging the Interpretability Gap in Customer Segmentation,” there!

At this Fall’s Open Data Science Conference, I will talk about how to bring a systematic approach to the interpretation of clustering models. To get ready for that, let’s talk about data visualization for clustering models.

Preparing a Workspace

All of these visualizations can be created with the basic tools of data manipulation (pandas and numpy) and the basics of visualization (matplotlib and seaborn).

from matplotlib import colormaps, pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
import seaborn as sns

For this tutorial, I’ll use the diabetes prediction dataset built into matplotlib. I’ll offer a lot more insight on how to train and evaluate an effective clustering model at ODSC, but for now, I’ll just fit a few simple k-means models.

# load diabetes data
diabetesData = load_diabetes(as_frame = True).data

# center and scale clusterable features
diabetesScaler = MinMaxScaler().fit(diabetesData)
diabetesDataScaled = pd.DataFrame(diabetesScaler.transform(diabetesData)
                                  , columns = diabetesData.columns
                                  , index = diabetesData.index)

# build three small clustering models
km3 = KMeans(n_clusters = 3).fit(diabetesDataScaled)
km4 = KMeans(n_clusters = 4).fit(diabetesDataScaled)
km10 = KMeans(n_clusters = 10).fit(diabetesDataScaled)

Choosing a Color Scheme

The matplotlib package provides a number of built-in color schemes through its colormaps registry. It is convenient to choose one colormap for the entirety of a visualization, and important to choose thoughtfully. That can mean evaluating everything from whether the map is sequential (for when data can be interpreted along a scale from low to high) or divergent (for when data is most relevant at either of two extremes) to whether it is thematically appropriate for the subject (greens and browns for a topography project). When there is no particular relationship between the data and the order it will be presented in, the nipy_spectral colormap is a good choice.

# choose the nipy_spectral colormap from matplotlib
nps = colormaps['nipy_spectral']

# view the whole colormap
nps

Each matplotlib colormap consists of a series of tuples, with each describing a color in RGBA format (though with components scaled to [0, 1] rather than [0, 255]). Individual colors from the map can be accessed either by integer (between 0 and 255) or by float (between 0 and 1). Numbers close to 0 correspond to colors at the lower end of the color map, while integers close to 255 and floats close to 1.0 correspond to colors at the upper end of the color map. Intuitively, the same color can be described by either an integer, or a float representing that integer as a quotient of 255.

# view select colors from the colormap
print(nps(51))
#(0.0, 0.0, 0.8667, 1.0)

print(nps(0.2))
#(0.0, 0.0, 0.8667, 1.0)

Creating Visualizations

Scatter Plots

The classic visualization for a clustering model is a series of scatter plots comparing each pair of features that went into the clustering model, with cluster assignment denoted by color. There are built in methods to achieve this, but a DIY approach gives more control over details like the color scheme.

def plotScatters(df, model):
    """ Create scatter plots based on each pair of columns in a dataframe.
    Use color to denote model label.
    """

    # create a figure and axes
    plotRows = df.shape[1]
    plotCols = df.shape[1]
    fig, axes = plt.subplots(
        # create one row and one column for each feature in the dataframe
        plotRows, plotCols
        # scale up the figure size for easy viewing
        , figsize = ((plotCols * 3), (plotRows * 3))
    )   
    # iterate through subplots to create scatter plots
    pltindex = 0
    for i in np.arange(0, plotRows):
        for j in np.arange(0, plotCols):
            pltindex += 1
            # identify the current subplot
            plt.subplot(plotRows, plotCols, pltindex)
            plt.scatter(
                # compare the i-th and j-th features of the dataframe
                df.iloc[:, j], df.iloc[:, i]
                # use integer cluster labels and a color map to unify color selection
                , c = model.labels_, cmap = nps
                # choose a small marker size to reduce overlap
                , s = 1)
            # label the x axis on the bottom row of sub plots
            if i == df.shape[1] - 1:
                plt.xlabel(df.columns[j])
            # label the y axis on the first column of sub plots
            if j == 0:
                plt.ylabel(df.columns[i])           

    plt.show()

These plots do double duty, showing the relationship between a pair of features and the relationship between those features and cluster assignment.

plotScatters(diabetesDataScaled, km3)

As analysis progresses, it’s easy to focus on a smaller subset of features.

plotScatters(diabetesDataScaled.iloc[:, 2:7], km4)

Violin Plots

To get a better sense of the distribution of each feature within each cluster, we can also look at violin plots. If you’re not familiar with violin plots, think of them as the grown up cousin of the classic box plot. Where box plots identify only a few key descriptors of a distribution, violin plots are contoured to illustrate the entire probability density function.

def plotViolins(df, model, plotCols = 5):
    """ Create violin plots of each feature in a dataframe
    Use model labels to group.
    """  

    # calculate number of rows needed for plot grid
    plotRows = df.shape[1] // plotCols
    while plotRows * plotCols < df.shape[1]:
        plotRows += 1      

    # create a figure and axes
    fig, axes = plt.subplots(plotRows, plotCols
                             # scale up the figure size for easy viewing
                             , figsize = ((plotCols * 3), (plotRows * 3))
                            )  

    # identify unique cluster labels from model
    uniqueLabels = sorted(np.unique(model.labels_))    

    # create a custom subpalette from the unique labels
    # this will return
    npsTemp = nps([x / max(uniqueLabels) for x in uniqueLabels])  

    # add integer cluster labels to input dataframe
    df2 = df.assign(cluster = model.labels_)  

    # iterate through subplots to create violin plots
    pltindex = 0
    for col in df.columns:
        pltindex += 1
        plt.subplot(plotRows, plotCols, pltindex)
        sns.violinplot(
            data = df2
            # use cluster labels as x grouper
            , x = 'cluster'
            # use current feature as y values
            , y = col
            # use cluster labels and custom palette to unify color selection
            , hue = model.labels_
            , palette = npsTemp
        ).legend_.remove()
        # label y axis with feature name
        plt.ylabel(col)   

    plt.show()

plotViolins(diabetesDataScaled, km3, plotCols = 5)

Histograms

Violin plots show the distribution of each feature within each cluster, but it is also helpful to look at how each cluster is represented in the broader distribution of each feature. A modified histogram can illustrate this well.

def histogramByCluster(df, labels, plotCols = 5, nbins = 30, legend = False, vlines = False):
    """ Create a histogram of each feature.
    Use model labels to color code.
    """
 
    # calculate number of rows needed for plot grid
    plotRows = df.shape[1] // plotCols
    while plotRows * plotCols < df.shape[1]:
        plotRows += 1

    # identify unique cluster labels
    uniqueLabels = sorted(np.unique(labels))
  
    # create a figure and axes
    fig, axes = plt.subplots(plotRows, plotCols
                             # scale up the figure size for easy viewing
                             , figsize = ((plotCols * 3), (plotRows * 3))
                            )
    pltindex = 0
    # loop through features in input data
    for col in df.columns:
        # discretize the feature into specified number of bins
        tempBins = np.trunc(nbins * df[col]) / nbins
        # cross the discretized feature with cluster labels
        tempComb = pd.crosstab(tempBins, labels)
        # create an index in the same size as the cross tab
        # this will help with alignment
        ind = np.arange(tempComb.shape[0])

        # identify the relevant subplot
        pltindex += 1
        plt.subplot(plotRows, plotCols, pltindex)
        # create grouped histogram data
        histPrep = {}
        # work one cluster at a time
        for lbl in uniqueLabels:
            histPrep.update(
                {
                    # associate the cluster label...
                    lbl:
                    # ... with a bar chart
                    plt.bar(
                        # use the feature-specific index to set x locations
                        x = ind
                        # use the counts associated with this cluster as bar height
                        , height = tempComb[lbl]
                        # stack this bar on top of previous cluster bars
                        , bottom = tempComb[[x for x in uniqueLabels if x < lbl]].sum(axis = 1)
                        # eliminate gaps between bars
                        , width = 1
                        , color = nps(lbl / max(uniqueLabels))
                    )
                }
            )
       
        # use feature name to label x axis of each plot
        plt.xlabel(col)
    
        # label the y axis of plots in the first column
        if pltindex % plotCols == 1:
            plt.ylabel('Frequency')
        plt.xticks(ind[0::5], np.round(tempComb.index[0::5], 2))
     
        # if desired, overlay vertical lines
        if vlines:
            for vline in vlines:
                plt.axvline(x = vline * ind[-1], lw = 0.5, color = 'red')
    
    if legend:
        leg1 = []; leg2 = []
        for key in histPrep:
            leg1 += [histPrep[key]]
            leg2 += [str(key)]
        plt.legend(leg1, leg2)

    plt.show()
histogramByCluster(diabetesDataScaled, km4.labels_)

This process scales easily when more cluster categories are needed.

histogramByCluster(diabetesDataScaled, km10.labels_)

Conclusion

These visualizations will provide a strong base for evaluating clustering models. For more about how to do so in a systematic way, be sure to come to my talk at this Fall’s Open Data Science Conference in San Francisco!

About the Author:

Evie Fowler is a data scientist based in Pittsburgh, Pennsylvania. She currently works in the healthcare sector leading a team of data scientists who develop predictive models centered on the patient care experience. She holds a particular interest in the ethical application of predictive analytics and in exploring how qualitative methods can inform data science work. She holds an undergraduate degree from Brown University and a master’s degree from Carnegie Mellon.

 

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1