With the holidays long gone and likely everyone’s New Year’s resolutions along with it I figured I would spend some time working with data instead of working with gym weights. Lacking any real inspiration, a friend pointed me to the University of California, Irvine’s Machine Learning data repository. Specifically, he sent me the link to the PAMAP2 Physical Activity Monitoring Data Set.
This dataset contains longitudinal data from 9 subjects. Each is hooked up to various biometric sensors that provide a reading each second. As each subject migrates through her day the subject’s activity is also recorded like driving or playing soccer. For a fictitious example, suppose Meghan was wearing these biometric sensors, and driving her car to soccer practice. The sensors would provide second by second data such as heart rate during the drive. Next, she would start her workout and again the sensors would provide data presumably with an increased heart rate. Considering automotive telematics and personal fitness trackers all produce similar data I was intrigued to explore modeling the data as a classification problem.
Since so many New Year’s resolutions center around working out I figure it’s a fairly timely post. First, I will show you how to get the data and organize it. Then just a little exploratory data analysis (EDA), followed by preprocessing, partitioning and then we apply the K-Nearest Neighbor (KNN) algorithm the data. The goal is to use biometric data to classify what activity a person is doing.
This data is housed in a zip file on the UCI site. Considering the size, I opted to use
data.table since it reads and manipulates data efficiently. I also use
pbapply for applying functions with a progress bar. This helps me understand how long my code is going to take when working with millions of rows. Next, the
plyr package is great for data manipulation and preprocessing. Lastly, although I could model a KNN algorithm in multiple packages, even
caret, I chose
klaR because it builds KNN algorithms fast.
After downloading the zip file, you will have to unpack the “.dat” files. Each file contains the tabled data ordered chronologically for the subjects’ data. There are two folders with subject data. Rather than specify files individually you can programmatically scan the folders and then amass all the biometric into a single data table. The
list.files function will search for a pattern within a specified folder. So
temp2 become string vectors with the full file path for any files ending in “*.dat” I concatenated the two file objects into a single object,
temp1 <- list.files(path='~/pmap/PAMAP2_Dataset/Protocol', pattern="*.dat",full.names = T)
temp2 <- list.files(path='~/pmap/PAMAP2_Dataset/Optional', pattern="*.dat",full.names = T)
fread will read any table into your R session. This is applied to each individual file path in
temp. All of this is then unified using
rbindlist which row binds the list containing individual subject data.
activity.data<- rbindlist(pblapply(temp, fread),fill=T)
Instead of imputing missing values I decided to omit records containing any NA values. This is a large data set and really only a hobby post so I decided not to. If you want to impute and therefore model on more records I usually use the
VIM package and particularly the
hotdeck function has been helpful in the past. In this case I pass in the base
complete.cases function into the bracketed data table. The
complete.cases function creates a T/F Boolean output. True represents a row without any NA values. The data table will automatically retain any T values with the code below.
The variable names are mostly incomprehensible so the code below changes them. I simply paste the hand, chest and ankle measurements to a sequence of numbers coinciding with the data frame’s column number. Then I declare the
colnames to be a character vector with the non-measurement inputs.
The dependent or Y feature is a multi-class factor corresponding to a person’s activity. Although the value is an integer the data’s pdf defines the actual states. To re-map the target feature I first create
y.code. This is a numeric vector with existing activity codes. Then I create
y.class as a string vector with each activity.
mapvalues function accepts a vector of values to change then a “from” and “to” parameters. The code passes in the
activity.data$Y_activityID vector followed by
y.class. The code snippet rewrites the existing
activity.data$Y_activityID. The second line changes the remapped values from characters to factors.
from = y.code, to = y.class)
The target feature is now a factor corresponding to the data dictionary. Check it out with
sample and an integer e.g.
Although not the point of the post, it’s a good idea to perform EDA anytime you are modeling. At a minimum I like to tally the target feature. This will help you understand if you have severely unbalanced targets which affects how you construct a model matrix. Use
Y_activityID to print the tally.
You can also make a quick visual by nesting the
table data inside
barplot. So the labels do not get cut off, specify margins in your graphics device. This is done with
par and declaring
mar with integer values that provide the cushion around the edges of the plot. Next take the previous code and nest it in
op <- par(mar=c(11,4,4,2))
The activity distribution from the 9 subjects.
The basic EDA function
summary can be applied to a data frame and will return information for each vector. To save time on this large data set, I took a random sample of the entire data table. It’s easy to sample a data table using the code below. You can use
sample within the indexing code by first passing in the
.N follow by the number of records to sample. This code will grab 10,000 records to create
eda.sample. Now calling summary on the subset data will calculate the information faster.
A screenshot of the sampled activity data showing the summary information for some inputs.
An Irresponsibly quick KNN explanation
The KNN algorithm is an analogous method. This means the predictions come from similar or analogous records. This is a common sense approach. Let’s say you have data shown below in a scatter plot with 2 classes Red and Green.
This visual represents your training set because the target, red or green, is known. Now you are presented a new record shown as a grey triangle in the graph below. Would you guess the unknown record is red or green?
If you look at the nearest neighbors to the triangle you may guess the new record is a red dot. This new record is analogous to the closest records.
A tuning parameter of KNN is the number of nearest neighbors. You have to specify the number of neighbors in case new points are equal distance to both classes. For example this graph shows a more centered grey triangle. If you are restricted to a single neighbor you wouldn’t know which class because the triangle is exactly in between opposing markers. This makes it harder to pick a color if you are looking for the single closest Red or Green marker. So instead a K =3 in KNN would improve the results. For the sake of this illustration I added arrows to the closest 2 dots. 1 of the 3 neighbors is RED, the other 2 are GREEN so the probability of being green is 66%.
Keep in mind that distance is measured as Euclidean meaning the straight line distance to the nearest known record. Remember your Pythagorean Theorem days in geometry? That’s the stuff of Euclidean distance. Also this data is complex and distance occurs in hyperspace not the 2 dimensions shown.
Center & Scaling
The problem with measuring Euclidean distance is that any values that have different orders of magnitude will impact the KNN algorithm significantly. For example, if you were modeling customer outcomes and income is measured in thousands and number of children were (likely) single digits, distances between incomes will seem larger than between children. In this approach you have to scale and center your inputs.
To understand the impact of scaling and centering apply it to the eda.sample data. The scale function can be applied to the data frame with additional parameters set to TRUE.
Keep in mind that you do not want to scale the dependent variable just the inputs.
Also I don’t scale or even model on the timestamp feature. This removes the temporal aspect of the modeling, since from second to second a subject is likely doing the same activity. You could feature engineer an input that captures the longitudinal information in the timestamp but I just omit timestamp and the target using the index 3:54 below.
Now you can compare the summary output on the
eda.sample. Notice the mean for all vectors is now 0. Centering a vector subtracts the mean average from each individual value. Scale will divide the new value by the vector’s standard deviation. Essentially this normalizes each value to its distance from an average of 0 and puts the values on the same scale so no single attribute would dictate a larger Euclidean distance.
A portion of the eda.scale summary with mean at zero.
Now that you understand the center and scaling function let’s apply it to the entire data set. The first input to scale is now
activity.data[,3:54, with=F]. The second line simply adds the dependent activity to the new scaled inputs.
To start let’s set a seed so you get the same results.
When modeling you should partition your data. This makes overfitting your data much harder and ensure your choices are a priori. The
createDataPartition is a smart function that will partition rows. The inTrain object will match the target distribution,
activity.scale$Y_activityID, and get 70% of the row numbers. Then this numeric object is used to index the
validation.data. In the end you have 70% of the rows in your training set and the remainder in the holdout.
inTrain <- createDataPartition(activity.scale$Y_activityID, p=0.7, list=FALSE)
train.data <- activity.scale[inTrain,]
validation.data <- activity.scale[-inTrain,]
Using the simple KNN from klaR will speed up the model build considerably but as with as many data science problems there is an accuracy penalty for taking a shortcut. Simple KNN looks at kernel densities and makes an assumption about the target’s classes. I am ok with the accuracy vs speed tradeoff considering the numerous inputs of highly dimensional data like this means a lot of distance measures during training.
sknn pass in the name of the Y variable and then a tilde followed by a period. This is the formula method instructing sknn to target
Y_activityID and use all other columns as inputs. Then specify the
train.data and a gamma greater than 0. Gamma is a tuning parameter that declares a gaussian like density is used to weight the classes of the k nearest neighbors
x <- sknn(Y_activityID ~ ., data = train.data, gamma=0.5)
To save time I convert the
validation.data to a data table and then sample it to 100 records. Without sampling the predictions take a long time due to the numerous inputs. In a professional setting it is worth the wait but my New Year’s Resolution is working with data not staring at my R console waiting for processes to finish ☺
Now apply the generic
predict function to the
data. The first parameter is the fit model x, followed by the data. Remember this is data the algorithm has never encountered and should be a true measure of the model’s predictive power. After a few agonizing minutes the
preds object will be created.
confusionMatrix from caret, pass in the
preds$class vector and the actual outcomes in
validation.sample$Y_activityID. Reviewing the matrix illustrates the number of correct classifications. To calculate the overall accuracy sum the diagonal values and divide by the sum of all values in the confusion matrix. Overall the accuracy is 94%!
A small portion of the confusion matrix showing car_driving was classified correctly.
I hope you liked this post. It was fun to tackle a problem and not be all that vested in the outcome so I could take some shortcuts with sampling and learn about simple KNN (sknn). The data is robust and it is likely that other classification approaches could improve results. In the end, writing this post helps me keep my New Year’s Resolution to work with data every day in R.
Working at Liberty Mutual I shape the organization's strategy and vision concerning next generation vehicles. This includes today's advanced vehicle ADAS features and the self driving cars of the (near) future. I get to work with exciting startups, MIT labs, government officials, automotive leaders and various data scientists to understand how Liberty can thrive in this rapidly changing environment. Plus I get to internally incubate ideas and foster an entrepreneurial ethos! Specialties: Data Science, Text Mining, IT service management, Process improvement and project management, business analytics