Intro to Data Mining, Kmeans and Hierarchical Clustering
BlogModelingClusteringData Miningkmeansposted by Anshu Gandotra September 14, 2017
Introduction
In this article, I will discuss what is data mining and why we need it? We will learn a type of data mining called clustering and go over two different types of clustering algorithms called Kmeans and Hierarchical Clustering and how they solve data mining problems
Table of Contents
 What is data mining? Why is it needed
 Five steps involved in data mining
 Data mining Tasks
 What is clustering
 Two types of clustering Algorithm
 Kmeans Clustering Algorithm
 Elbow Method to determine the optimal number of clusters
 Hierarchal Clustering
 Kmeans Clustering Algorithm
What is data mining and why do we need it?
It is very difficult to find and understand relevant data because it’s collected and stored at a very massive speed over the networks. For example, credit card transactions, demographic data, and web server logs occupy terabytes of data. Thus, data mining is a technique used for analysis and exploration of large amount of data to uncover meaning insights. It helps in understanding, sorting and selecting relevant information. It uncovers hidden values in the databases and transforms data into useful information
Five steps involved in the data mining process
Now, let’s delve into the main five steps involved in data mining process. I have shown data mining steps in Figure1: below. The explanation of five steps are also discussed below
Figure1: Data mining steps
Step 1: Selection: In the selection step the data is first collected and integrated from all the variety of sources. We collect only those data that will help to gain meaningful insight.
Step 2: Preprocessing: In the preprocessing step the data that is collected needs preprocessing as it is may not be clean. The processing step consists of removing missing values or inconsistent data. So we need to apply various techniques to remove such anomalies.
Step 3: Transformation: After preprocessing step we need to transform data into forms appropriate for mining. This includes aggregation, normalization etc.
Step 4: data mining: Now we can apply data mining techniques on the data to uncover insights. There are various data mining techniques like classification, clustering, and association etc.
Step 5: Evaluation: This step involves the pattern evaluation like visualization, removing patterns that are redundant from the patterns generated
Next, let’s understand two main data mining tasks and in which category the clustering comes.
Data mining tasks
Figure 2: Data mining tasks
The two main data mining tasks consists of:
 Predictive Methods: This method uses some variables to predict unknown values of other variables. It includes data mining task such as classification.

 Description Methods: This method helps to find data describing patterns so that it comes up with new important information from the available data. The descriptive methods consist of clustering and association.
What is Clustering?
Clustering is a type of unsupervised learning which means that it finds hidden structure in unlabeled data. It is a method used to group similar objects (close in terms of distance) together in the same group called cluster [1]. The data objects in a cluster will be similar to one another within the same cluster and different from the objects in other cluster. It uses similarity between features to group them into clusters. Euclidean distance measure is commonly used to determine the similarity of two objects.
Now let’s understand the Figure 3: below. There are three clusters such as cluster 1, cluster 2 and cluster 3. The main point here to know is that it grouped similar objects (close in terms of distance) together in the same group as clusters such that intracluster distances are minimized and intercluster distances are maximized
Figure 3: Clustering
Two type of clustering algorithms
The main two clustering algorithms are:
 Kmeans clustering
 Hierarchical clustering
Let’s first understand how Kmeans clustering algorithm works and then I will explain kmeans clustering using iris dataset from http://archive.ics.uci.edu/ml/datasets/Iris?ref=datanews.io using R Language
 Kmeans Clustering:
It uses the partitioning approach. The main goal of this algorithm to partition n objects into k clusters and then evaluate them by some criterion such as minimizing the withincluster sum of squares (WCSS). Let x is the set of observations and μ denotes the mean
Then, we can calculate WCSS as shown in formula below:
The process of kmeans clustering in five steps below:

 First the number of k clusters are specified
 Then we select k number of cluster centers v
 Assign each data point to the cluster which is closest measured by Euclidean distance, cosine similarity etc.
 Recalculate the cluster centers by finding mean of data points belonging to the same clusters
 Steps 2, 3 and 4 are repeated until convergence has been reached. That means until the WCSS change is very little.
Let me use Iris dataset which can be found from http://archive.ics.uci.edu/ml/datasets/Iris?ref=datanews.io and below shows step by step Kmeans clustering using R.
Step 1: Package Installation
The very first step is to install the following packages in R using the following commands


 Install. Packages (cluster): Cluster
 Install. Packages (factoextra): for visualizing clusters using ggplot2
 Install. Packages (NbClust): for finding the optimal number of clusters

Step 2: Data Preparation
Next let’s prepare the data for kmeans clustering
 Load the data into a R using command :
data(iris)
 To view few rows of the data to better understand the data use command
head (iris).
The below figure shows the output. There are 5 columns. The column 14 are the attributes and 5th column is the class label.
Figure 4: First 3 rows iris dataset
Step 3: Data Transformation
In this step we will remove Species column which is a class label and scale the data by using the following command below
t < scale (iris [, 5])
We need to normalize the data as we can see in Figure 4: above that column Sepal. Length and Petal. Width have different ranges and it may affect our performance for further analysis
Step 4: Run data mining kmeans clustering
Actually most of you may be familiar with iris dataset and know that it has 3 classes in the class label (Sesota, Versicolor, and Virginica) so, we can use k =3 for kmeans clustering as discussed above various steps in Kmeans Clustering. Usually we don’t know the correct number of clusters so need to guess or use estimation methods that l will discuss later in the article. The following below shows how to run Kmeans clustering in R.
 The Kmeans () function is used in R for Kmeans clustering. The syntax for kmeans() function is as shown below
Kmeans (x, centers, iter.max = 10, nstart = 1)
Here:
 x :is a numeric matrix of data
 centers: the number of clusters or cluster centres
 Iter.max: the maximum number of iterations allowed
 nstart: if centers is a number, how many random sets should be chosen?
We know that we have 3 clusters and we can set iter.max =10, nstart =15 and x = t as shown in step 3 above. Run the following command in R.
rm < Kmeans (t, 3, iter.max = 10, nstart = 15)
 Next we need to group each observations using the following command
rm$cluster
 Now we can visualize the kmeans cluster using the fviz_cluster () function. The syntax for fviz_cluster() function is as shown below
fviz_cluster (object, data, geom = “point or text”, stand = FALSE/TRUE, ellipse.type = “Euclid/norm”)
Let’s set our values in the fviz_cluster function defined
 object: rm
 data: t
 geom = used to specify the geometry for the graph. Allowed values are point and text. We will use point to show only points
 stand: =logical value; if TRUE, data is standardized before principal component analysis. We will use FALSE
 ellipse.type: specifies the frame type. Possible values are convex, confidence, norm or Euclid. We will use Eucid
Run the following command in R.
fviz_cluster (rm, t, geom = point, stand =FALSE, ellipse.type = “Euclid”)
The output of the above command is shown in Figure 5: below. We can see three clusters in different colors and one class separate from others. The other two classes have overlapping attribute ranges so it is not possible separate completely.
Figure 5: Clusters
I hope you all have understood the kmeans clustering using Iris dataset. As described earlier that usually we don’t know the correct number of clusters so need to guess or use estimation methods to determine the optimal number of clusters. The estimation method that I will explain in this article is elbow method as described below.
Elbow Method to determine the optimal number of clusters
Let’s first understand step by step how the elbow method works:
Step 1: We need to first compute kmeans clustering algorithm by taking different values of K
Step 2: In this step, for each value of k, we need to calculate the total withincluster sum of square
Step 3: Plot the curve of withincluster sum of square (wss) according to the number of clusters k.
Step4: The location of a bend (knee) in the plot indicates the appropriate number of clusters.
In R package we use factoextra package to implement elbow method by using fviz_nbclust() function. The syntax for fviz_nbclust() function is as shown below
fviz_nbclust(x, FUNcluster, method = c (“silhouette”, “wss”))
x: data frame or numeric matrix
FUNcluster: is a partitioning function we have kmeans
method: to determine the optimal number of clusters either wss or silhouette
The R code shown below computes the elbow method for kmeans() using iris dataset discussed above
fviz_nbclust (t, kmeans, method = “wss”) + geom_vline(xintercept = 3, linetype = 2)
The output shows in Figure 6: below the best number of cluster for iris dataset is 3. As described above the location of a bend (knee) in the plot indicates the appropriate number of clusters. As per our knowledge of the iris dataset we know that there are three classes in the data
Figure 6: Optimal Number of Clusters
©ODSC2017
The Top Machine Learning Research of June 2024
Machine Learningposted by ODSC Team Jul 12, 2024
LangGraph: The Future of ProductionReady AI Agents
Europe 2024Modelingposted by ODSC Community Jul 12, 2024
RetrievalAugmented Generation (RAG): A Synergistic Approach to NLU and NLG
APAC 2024Modelingposted by ODSC Community Jul 12, 2024