fbpx
Beginner’s Guide to K-Nearest Neighbors in R: from Zero to Hero Beginner’s Guide to K-Nearest Neighbors in R: from Zero to Hero
In the world of Machine Learning, I find the K-Nearest Neighbors (KNN) classifier makes the most intuitive sense and easily accessible... Beginner’s Guide to K-Nearest Neighbors in R: from Zero to Hero

In the world of Machine Learning, I find the K-Nearest Neighbors (KNN) classifier makes the most intuitive sense and easily accessible to beginners even without introducing any math notations.

To decide the label of an observation, we look at its neighbors and assign the neighbors’ label to the observation of interest. Certainly, looking at one neighbor may create bias and inaccuracy, and the KNN method has a set of rules and procedures to determine the best number of neighbors, e.g., examining k>1 neighbors and adopt majority rule to decide the category.

Agor153

Measure of Distance

To choose the nearest neighbors, we have to define what distance is. For categorical data, there are Hamming Distance and Edit Distance. More information can be found here, as I won’t delve into the math in this post.

What is K-Fold Cross Validation?

In Machine Learning, Cross-Validation (CV) plays a crucial role in model selection and has a wide range of applications. In fact, CV has a rather straightforward design idea and also makes intuitive sense.

It can be briefly stated as follows:

  1. divide the data into K equally distributed chunks/folds
  2. choose 1 chunk/fold as a test set and the rest K-1 as a training set
  3. develop an ML model based on the training set
  4. compare predicted value VS true value on the test set only
  5. apply the ML model to the test set and repeat K times using each chunk
  6. add up the metrics score for the model and average over K folds

How to choose K?

As you probably noticed, the tricky part of CV is how to set the value for K. Let’s say the total sample size = n. Technically, we can set K to any value between 1 and n.

If k = n, we basically take 1 observation out as the training set and the rest n-1 cases as the test set. Then, repeat the process to the entire dataset. This is called “Leave-one-out cross-validation” (LOOCV).

Well, LOOCV requires a lot of computational power and may run forever if your dataset is big. Take a step back, there is no such thing as the best k value, and neither is it true that a higher k is a better k.

To choose the most appropriate k folds, we have to make a tradeoff between bias and variance. If k is small, we have a high bias but a low variance for estimating test error; if k is big, we have a low bias and a high variance.

Implementation in R

1. Software Preparation

# install.packages(“ISLR”)
# install.packages(“ggplot2”) # install.packages(“plyr”)
# install.packages(“dplyr”) # install.packages(“class”)# Load libraries
library(ISLR) 
library(ggplot2) 
library(reshape2) 
library(plyr) 
library(dplyr) 
library(class)# load data and clean the dataset
banking=read.csv(“bank-additional-full.csv”,sep =”;”,header=T)##check for missing data and make sure no missing data
banking[!complete.cases(banking),]#re-code qualitative (factor) variables into numeric
banking$job= recode(banking$job, “‘admin.’=1;’blue-collar’=2;’entrepreneur’=3;’housemaid’=4;’management’=5;’retired’=6;’self-employed’=7;’services’=8;’student’=9;’technician’=10;’unemployed’=11;’unknown’=12”)#recode variable again
banking$marital = recode(banking$marital, “‘divorced’=1;’married’=2;’single’=3;’unknown’=4”)banking$education = recode(banking$education, “‘basic.4y’=1;’basic.6y’=2;’basic.9y’=3;’high.school’=4;’illiterate’=5;’professional.course’=6;’university.degree’=7;’unknown’=8”)banking$default = recode(banking$default, “‘no’=1;’yes’=2;’unknown’=3”)banking$housing = recode(banking$housing, “‘no’=1;’yes’=2;’unknown’=3”)banking$loan = recode(banking$loan, “‘no’=1;’yes’=2;’unknown’=3”)banking$contact = recode(banking$loan, “‘cellular’=1;’telephone’=2;”)banking$month = recode(banking$month, “‘mar’=1;’apr’=2;’may’=3;’jun’=4;’jul’=5;’aug’=6;’sep’=7;’oct’=8;’nov’=9;’dec’=10”)banking$day_of_week = recode(banking$day_of_week, “‘mon’=1;’tue’=2;’wed’=3;’thu’=4;’fri’=5;”)banking$poutcome = recode(banking$poutcome, “‘failure’=1;’nonexistent’=2;’success’=3;”)#remove variable “pdays”, b/c it has no variation
banking$pdays=NULL #remove variable “duration”, b/c itis collinear with the DV
banking$duration=NULL

After loading and cleaning the original dataset, it is a common practice to visually examine the distribution of our variables, checking for seasonality, patterns, outliers, etc.

#EDA of the DV
plot(banking$y,main="Plot 1: Distribution of Dependent Variable")

As can be seen, the Outcome Variables (Banking Service Subscription) are not equally distributed, with many more “No”s than “Yes”s.

This is unsurprisingly inconvenient for supervised learning when we try to classify future labels correctly. As expected, the rate of false positive would be high as a lot of minority cases would be classified as the majority label.

In fact, the unbalanced distribution may prefer a non-parametric ML classifier, as my other post (Rare Event Classification Using 5 Classifiers) shows KNN performs the best after comparing it to other ML methods. This may be caused by the underlying maths and statistical assumptions between parametric and non-parametric models.

2. Data Split

As mentioned above, we need to split the dataset into a training set and a test set and adopt k-fold cross-validation to pick the best ML model. A rule of thumb, we stick to the “80–20” division: we train ML models on 80% of the data and test it on the rest 20%. Slightly different for Time Series data, we change to 90% VS 10%.

#split the dataset into training and test sets randomly, but we need to set seed so as to generate the same value each time we run the codeset.seed(1)#create an index to split the data: 80% training and 20% test
index = round(nrow(banking)*0.2,digits=0)#sample randomly throughout the dataset and keep the total number equal to the value of index
test.indices = sample(1:nrow(banking), index)#80% training set
banking.train=banking[-test.indices,] #20% test set
banking.test=banking[test.indices,] #Select the training set except the DV
YTrain = banking.train$y
XTrain = banking.train %>% select(-y)# Select the test set except the DV
YTest = banking.test$y
XTest = banking.test %>% select(-y)

So far, we have finished data preparations and move on to model selection.

3. Train Models

Let’s create a new function (“calc_error_rate”) to record the misclassification rate. The function calculates the rate when the predicted label using the training model does not match with the actual outcome label. It measures classification accuracy.

#define an error rate function and apply it to obtain test/training errorscalc_error_rate <- function(predicted.value, true.value){
 return(mean(true.value!=predicted.value)) 
}

Then, we need another function, “do.chunk()”, to do k-fold Cross Validation. The function returns a data frame of the possible values of folds. The main purpose of this step is to select the best K value for KNN.

nfold = 10
set.seed(1)# cut() divides the range into several intervals
folds = seq.int(nrow(banking.train)) %>%
     cut(breaks = nfold, labels=FALSE) %>%  
     sampledo.chunk <- function(chunkid, folddef, Xdat, Ydat, k){ 
     train = (folddef!=chunkid)# training indexXtr = Xdat[train,] # training set by the indexYtr = Ydat[train] # true label in training setXvl = Xdat[!train,] # test setYvl = Ydat[!train] # true label in test setpredYtr = knn(train = Xtr, test = Xtr, cl = Ytr, k = k) # predict training labelspredYvl = knn(train = Xtr, test = Xvl, cl = Ytr, k = k) # predict test labelsdata.frame(fold =chunkid, # k folds 
train.error = calc_error_rate(predYtr, Ytr),#training error per fold 
 val.error = calc_error_rate(predYvl, Yvl)) # test error per fold
 }# set error.folds to save validation errors
error.folds=NULL# create a sequence of data with an interval of 10
kvec = c(1, seq(10, 50, length.out=5))set.seed(1)for (j in kvec){
 tmp = ldply(1:nfold, do.chunk, # apply do.function to each fold
 folddef=folds, Xdat=XTrain, Ydat=YTrain, k=j) # required arguments
 tmp$neighbors = j # track each value of neighbors
 error.folds = rbind(error.folds, tmp) # combine the results 
 }#melt() in the package reshape2 melts wide-format data into long-format data
errors = melt(error.folds, id.vars=c(“fold”,”neighbors”), value.name= “error”)

The upcoming step is to find the number of k that minimizes validation error

val.error.means = errors %>%
    #select all rows of validation errors
    filter(variable== “val.error” ) %>% 
    #group the selected data by neighbors
    group_by(neighbors, variable) %>%
    #cacluate CV error for each k
    summarise_each(funs(mean), error) %>%
    #remove existing grouping
    ungroup() %>% 
    filter(error==min(error))#the best number of neighbors
numneighbor = max(val.error.means$neighbors)
numneighbor## [20]

Therefore, the best number of neighbors is 20 after using 10-fold cross-validation.

4. Some Model Metrics

#training error
set.seed(20)
pred.YTtrain = knn(train=XTrain, test=XTrain, cl=YTrain, k=20)knn_traing_error <- calc_error_rate(predicted.value=pred.YTtrain, true.value=YTrain)
knn_traing_error[1] 0.101214

The training error is 0.10.

#test error
set.seed(20)
pred.YTest = knn(train=XTrain, test=XTest, cl=YTrain, k=20)knn_test_error <- calc_error_rate(predicted.value=pred.YTest, true.value=YTest)
knn_test_error[1] 0.1100995

The test error is 0.11.

#confusion matrixconf.matrix = table(predicted=pred.YTest, true=YTest)

Based on the above confusion matrix, we can calculate the following values and prepare for plotting the ROC curve.

Accuracy = (TP +TN)/(TP+FP+FN+TN)

TPR/Recall/Sensitivity = TP/(TP+FN)

Precision = TP/(TP+FP)

Specificity = TN/(TN+FP)

FPR = 1 — Specificity = FP/(TN+FP)

F1 Score = 2*TP/(2*TP+FP+FN) = Precision*Recall /(Precision +Recall)

# Test accuracy ratesum(diag(conf.matrix)/sum(conf.matrix))[1] 0.8899005# Test error rate1 - sum(drag(conf.matrix)/sum(conf.matrix))[1] 0.1100995

As you may notice, test accuracy rate + test error rate = 1, and I’m providing multiple ways of calculating each value.

# ROC and AUC
knn_model = knn(train=XTrain, test=XTrain, cl=YTrain, k=20,prob=TRUE)prob <- attr(knn_model, “prob”)prob <- 2*ifelse(knn_model == “-1”, prob,1-prob) — 1pred_knn <- prediction(prob, YTrain)performance_knn <- performance(pred_knn, “tpr”, “fpr”)# AUCauc_knn <- performance(pred_knn,”auc”)@y.valuesauc_knn[1] 0.8470583plot(performance_knn,col=2,lwd=2,main=”ROC Curves for KNN”)

In conclusion, we have learned what KNN is and built a pipeline of building a KNN model in R. More importantly, we have learned the underlying idea behind K-Fold Cross-validation and how to cross-validate in R.

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye

1