In this article, I will discuss what is data mining and why we need it? We will learn a type of data mining called clustering and go over two different types of clustering algorithms called K-means and Hierarchical Clustering and how they solve data mining problems
Table of Contents
- What is data mining? Why is it needed
- Five steps involved in data mining
- Data mining Tasks
- What is clustering
- Two types of clustering Algorithm
- K-means Clustering Algorithm
- Elbow Method to determine the optimal number of clusters
- Hierarchal Clustering
- K-means Clustering Algorithm
What is data mining and why do we need it?
It is very difficult to find and understand relevant data because it’s collected and stored at a very massive speed over the networks. For example, credit card transactions, demographic data, and web server logs occupy terabytes of data. Thus, data mining is a technique used for analysis and exploration of large amount of data to uncover meaning insights. It helps in understanding, sorting and selecting relevant information. It uncovers hidden values in the databases and transforms data into useful information
Five steps involved in the data mining process
Now, let’s delve into the main five steps involved in data mining process. I have shown data mining steps in Figure1: below. The explanation of five steps are also discussed below
Figure1: Data mining steps
Step 1: Selection: In the selection step the data is first collected and integrated from all the variety of sources. We collect only those data that will help to gain meaningful insight.
Step 2: Preprocessing: In the preprocessing step the data that is collected needs preprocessing as it is may not be clean. The processing step consists of removing missing values or inconsistent data. So we need to apply various techniques to remove such anomalies.
Step 3: Transformation: After preprocessing step we need to transform data into forms appropriate for mining. This includes aggregation, normalization etc.
Step 4: data mining: Now we can apply data mining techniques on the data to uncover insights. There are various data mining techniques like classification, clustering, and association etc.
Step 5: Evaluation: This step involves the pattern evaluation like visualization, removing patterns that are redundant from the patterns generated
Next, let’s understand two main data mining tasks and in which category the clustering comes.
Data mining tasks
Figure 2: Data mining tasks
The two main data mining tasks consists of:
- Predictive Methods: This method uses some variables to predict unknown values of other variables. It includes data mining task such as classification.
- Description Methods: This method helps to find data describing patterns so that it comes up with new important information from the available data. The descriptive methods consist of clustering and association.
What is Clustering?
Clustering is a type of unsupervised learning which means that it finds hidden structure in unlabeled data. It is a method used to group similar objects (close in terms of distance) together in the same group called cluster . The data objects in a cluster will be similar to one another within the same cluster and different from the objects in other cluster. It uses similarity between features to group them into clusters. Euclidean distance measure is commonly used to determine the similarity of two objects.
Now let’s understand the Figure 3: below. There are three clusters such as cluster 1, cluster 2 and cluster 3. The main point here to know is that it grouped similar objects (close in terms of distance) together in the same group as clusters such that intra-cluster distances are minimized and inter-cluster distances are maximized
Figure 3: Clustering
Two type of clustering algorithms
The main two clustering algorithms are:
- K-means clustering
- Hierarchical clustering
Let’s first understand how K-means clustering algorithm works and then I will explain k-means clustering using iris dataset from http://archive.ics.uci.edu/ml/datasets/Iris?ref=datanews.io using R Language
- K-means Clustering:
It uses the partitioning approach. The main goal of this algorithm to partition n objects into k clusters and then evaluate them by some criterion such as minimizing the within-cluster sum of squares (WCSS). Let x is the set of observations and μ denotes the mean
Then, we can calculate WCSS as shown in formula below:
The process of k-means clustering in five steps below:
- First the number of k clusters are specified
- Then we select k number of cluster centers v
- Assign each data point to the cluster which is closest measured by Euclidean distance, cosine similarity etc.
- Recalculate the cluster centers by finding mean of data points belonging to the same clusters
- Steps 2, 3 and 4 are repeated until convergence has been reached. That means until the WCSS change is very little.
Let me use Iris dataset which can be found from http://archive.ics.uci.edu/ml/datasets/Iris?ref=datanews.io and below shows step by step K-means clustering using R.
Step 1: Package Installation
The very first step is to install the following packages in R using the following commands
- Install. Packages (cluster): Cluster
- Install. Packages (factoextra): for visualizing clusters using ggplot2
- Install. Packages (NbClust): for finding the optimal number of clusters
Step 2: Data Preparation
Next let’s prepare the data for k-means clustering
- Load the data into a R using command :
- To view few rows of the data to better understand the data use command
The below figure shows the output. There are 5 columns. The column 1-4 are the attributes and 5th column is the class label.
Figure 4: First 3 rows iris dataset
Step 3: Data Transformation
In this step we will remove Species column which is a class label and scale the data by using the following command below
t <- scale (iris [, -5])
We need to normalize the data as we can see in Figure 4: above that column Sepal. Length and Petal. Width have different ranges and it may affect our performance for further analysis
Step 4: Run data mining k-means clustering
Actually most of you may be familiar with iris dataset and know that it has 3 classes in the class label (Sesota, Versicolor, and Virginica) so, we can use k =3 for k-means clustering as discussed above various steps in K-means Clustering. Usually we don’t know the correct number of clusters so need to guess or use estimation methods that l will discuss later in the article. The following below shows how to run K-means clustering in R.
- The K-means () function is used in R for K-means clustering. The syntax for kmeans() function is as shown below
Kmeans (x, centers, iter.max = 10, nstart = 1)
- x :is a numeric matrix of data
- centers: the number of clusters or cluster centres
- Iter.max: the maximum number of iterations allowed
- nstart: if centers is a number, how many random sets should be chosen?
We know that we have 3 clusters and we can set iter.max =10, nstart =15 and x = t as shown in step 3 above. Run the following command in R.
rm <- Kmeans (t, 3, iter.max = 10, nstart = 15)
- Next we need to group each observations using the following command
- Now we can visualize the k-means cluster using the fviz_cluster () function. The syntax for fviz_cluster() function is as shown below
fviz_cluster (object, data, geom = “point or text”, stand = FALSE/TRUE, ellipse.type = “Euclid/norm”)
Let’s set our values in the fviz_cluster function defined
- object: rm
- data: t
- geom = used to specify the geometry for the graph. Allowed values are point and text. We will use point to show only points
- stand: =logical value; if TRUE, data is standardized before principal component analysis. We will use FALSE
- ellipse.type: specifies the frame type. Possible values are convex, confidence, norm or Euclid. We will use Eucid
Run the following command in R.
fviz_cluster (rm, t, geom = point, stand =FALSE, ellipse.type = “Euclid”)
The output of the above command is shown in Figure 5: below. We can see three clusters in different colors and one class separate from others. The other two classes have overlapping attribute ranges so it is not possible separate completely.
Figure 5: Clusters
I hope you all have understood the k-means clustering using Iris dataset. As described earlier that usually we don’t know the correct number of clusters so need to guess or use estimation methods to determine the optimal number of clusters. The estimation method that I will explain in this article is elbow method as described below.
Elbow Method to determine the optimal number of clusters
Let’s first understand step by step how the elbow method works:
Step 1: We need to first compute k-means clustering algorithm by taking different values of K
Step 2: In this step, for each value of k, we need to calculate the total within-cluster sum of square
Step 3: Plot the curve of within-cluster sum of square (wss) according to the number of clusters k.
Step4: The location of a bend (knee) in the plot indicates the appropriate number of clusters.
In R package we use factoextra package to implement elbow method by using fviz_nbclust() function. The syntax for fviz_nbclust() function is as shown below
fviz_nbclust(x, FUNcluster, method = c (“silhouette”, “wss”))
x: data frame or numeric matrix
FUNcluster: is a partitioning function we have k-means
method: to determine the optimal number of clusters either wss or silhouette
The R code shown below computes the elbow method for kmeans() using iris dataset discussed above
fviz_nbclust (t, kmeans, method = “wss”) + geom_vline(xintercept = 3, linetype = 2)
The output shows in Figure 6: below the best number of cluster for iris dataset is 3. As described above the location of a bend (knee) in the plot indicates the appropriate number of clusters. As per our knowledge of the iris dataset we know that there are three classes in the data
Figure 6: Optimal Number of Clusters