Traditional Clustering Approaches
In machine learning, the most traditional and popular methods of clustering are hierarchical clustering (similarity-based clustering) and k-means clustering (feature-based clustering). Hierarchical clustering, put simply, is grouping together points in a vector space that are closest in distance from each other. Pseudo-code for hierarchical clustering would look something like this:
1. Calculate the distance between all objects. Store the results in a distance matrix.
2. Search through the distance matrix and find the two most similar clusters/objects.
3. Join the two clusters/objects to produce a cluster that now has at least two objects.
4. Update the matrix by calculating the distances between this new cluster and all other clusters.
5. Repeat step 2 until all cases are in one cluster.
Hierarchical clustering works great on small datasets. A major advantage of this method is the user does not need to know anything about the dataset in advance and specify any hyper-parameters (like number of clusters). But, once the datasets become significantly large, due to computational difficulties, hierarchical clustering becomes slow and does not show favorable results.
K-means clustering algorithm is faster when compared to hierarchical clustering methods. Pseudo-code for k-means clustering would look something like this:
1. User inputs a value for k (number of clusters).
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found above are correct.
5. If none of the N objects changed membership in the last iteration, exit. Otherwise, reiterate from step 3.
As you can see, our first step to this approach is to define the number of clusters. As a data scientist, I will have to iterate the number of clusters each time, visualize the results obtained each time, and continue to iterate the number of clusters until I am satisfied. Both of these approaches are well-researched and popular, and can produce acceptable results in a variety of situations.
Where Traditional Approaches Break Down
Unfortunately, traditional clustering approaches begin to break down when asked to handle certain data types—most notably, time series data. Time series data has a unique characteristic where the current data point depends on a specific number of previous data points, also known as a time window.
One problem that has been especially difficult to solve is time series anomaly detection: identifying interesting behavioral patterns in the underlying dataset. Take an oil and gas combustion turbine, for example. Based on historical data from the last three years, we can generate a machine learning model capable of identifying unique behavioral patterns (clusters) among the asset’s different operating modes. Using this model, anomalies can be detected if new incoming data points do not fall into any of the previously obtained clusters. These anomalies can then be flagged for a SME to act upon, facilitating maintenance efforts. Seems simple, right? There is one problem.
This paper posits that using traditional clustering approaches on time series data appear to be meaningless, in that the cluster centers/boundaries that were obtained for two time series data sets (random walk data and stock market data) appear to be astonishingly similar. This paper, in essence, has shown that clustering time series data will result in meaningless clusters, as they look eerily similar to results on clustering random noise.
A Better Way Forward Using Autoencoders
To overcome these shortcomings brought about by time series data, there is an approach known as Gaussian Mixture Model Variational Autoencoders (GMMVAE) which is a clustering approach within the framework of auto-encoders.
Think of autoencoders as a neural network-based approach towards dimensionality reduction. Autoencoders take a set of inputs X1, X2,…Xn, encode them to a lower-dimensional latent space, and then decode them back to represent the inputs as their outputs.Just like any other neural network, autoencoders have hidden layers and activation functions in their encoding/decoding process. Backpropagation is done to reduce the output loss by iterating the weights through the neural network, until the loss is small enough.
Gaussian Mixture Model Variational Autoencoders (GMMVAE)
Using autoencoders, Xie et. al. proposed deep embedded clustering (DEC) where input data was reduced to a lower dimensional feature space (like a PCA approach) and unsupervised algorithms were applied on this feature space. Even though they gave promising results over traditional clustering methods, this is an unsatisfactory approach, as the assumptions underlying the dimensionality reduction techniques are generally independent of the assumptions of the clustering techniques. To overcome these problems, a GMMVAE was proposed. Pseudo-code for GMMVAE would look something like this:
1. Initialize random weights for a neural network (N)
2. Feed input data into the neural network. Perform encoding of input data
3. Generate latent vectors that follow a Gaussian unit distribution.
4. Feed these latent vectors into a decoding network to generate samples of the input
5. Use re-parametrization trick to reduce loss
6. Perform density-based clustering based on different Gaussians that were formed in the latent vector space
Sample results using GMMVAE that was performed on a synthetic dataset are displayed below.
Traditional clustering approaches vs DEC
GMMVAE vs DEC
Analyzing the results of this approach (GMMVAE and DEC results on MNIST dataset are shown below), we see that GMMVAEs perform better than DEC methods also and demonstrate a great jump in clustering accuracy when compared to traditional clustering techniques.
Applications of Unsupervised Clustering Approaches at SparkCognition
Returning to practical applications, the SparkPredict is often deployed in the energy space to monitor crucial equipment. A more effective clustering algorithm would hold great value for a subject matter expert, as anomalies could be detected from a live stream of sensor data, and acted upon immediately if necessary.
It is very interesting to see the amount of research that is happening in the field of unsupervised learning. With the success and ever-growing research in the field of deep learning, neural networks can be tweaked to remember time series sequences also in this clustering approach. Using recurrent neural networks (RNNs) will be the obvious first improvement to explore. [KM6] It will be interesting to see how cluster accuracy improves when using RNNs, LSTMs GRUs etc. with these deep neural networks.
1. Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders, Nat Dilokthanakul1,∗ , Pedro A. M. Mediano1 , Marta Garnelo1 , Matthew C. H. Lee1 , Hugh Salimbeni1 , Kai Arulkumaran2 & Murray Shanahan1 1Department of Computing, 2Department of Bioengineering Imperial College London London, UK (https://arxiv.org/pdf/1611.02648.pdf)
2. Unsupervised Deep Embedding for Clustering Analysis – Xie et al; a. arXiv:1511.06335 [cs.LG] 3. https://medium.com/@llionj/the-reparameterization-trick-4ff30fe92954
5. Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering ∗ Zhuxi Jiang 1 , Yin Zheng 2 , Huachun Tan 1 , Bangsheng Tang 3 , Hanning Zhou 3 1Beijing Institute of Technology, Beijing, China 2Tencent AI Lab, Shenzhen, China 3Hulu LLC., Beijing, China (https://arxiv.org/pdf/1611.05148.pdf)