How To Build A Spam Classifier Using Decision Tree How To Build A Spam Classifier Using Decision Tree
In the realm of Supervised Learning, there are tons of classifiers, including Logistic Regressions (logit 101 and logit 102), LDA, Naive Bayes, SVM, KNN,... How To Build A Spam Classifier Using Decision Tree

In the realm of Supervised Learning, there are tons of classifiers, including Logistic Regressions (logit 101 and logit 102), LDA, Naive Bayes, SVM, KNN, Random Forest, Neural Networks, and so many more coming each day!

[Related Article: The Complete Guide to Decision Trees (part 1)]

The real question that all data scientists should be asking themselves is:


Are we choosing the fancy model?

As in the following Tweet,

what’s the point of building a 10 layer deep learning model when a simple regression is enough?

Let’s face it. We, data scientists, sometimes could be self-centered narcissist who cares more about showing off skills who do not listen to our clients’ needs.

Both internal-facing DS roles (e.g. People Analytics) and external-facing DS roles (e.g. Statistical Consultant), they need to provide quick solutions that non-tech colleagues and clients can understand and apply right away. Please don’t confuse them with DS terms, jargon, coefficient interpretations, or any other unnecessary hassles. Absolutely no value added if the ML models are too difficult to be useful.

With this in mind, we learn how to build a simple spam classifier using an interpretable ML classifier, Decision Tree, in this post (the UCI Machine Learning database hosts the dataset and can be accessed here).

AndyPandy from Pixabay

Decision Tree

Decision Tree is a supervised learning method that segments space of outcomes into J numbers of regions R(1), R(2), …, R(J) and predicts the response for each region R.

Using a recursive binary splitting, we construct a DT model in four simple steps:

  1. Split a region, R(j), based on a variable, X(i)
  2. Set a cutoff point, s, and split R(j) into two regions, if
  • {X|X(i)<s} = R(category_1)
  • {X|X(i)>s} = R(category_2)

3. Repeat the first two steps for the next region

4. Continue until we have used up all available units or there is a small number of units left in each leaf node.

In everyday layman’s terms, DT finds the best way of dividing the space and makes the space “purer” after each split. There are three ways of measuring how pure or impure the space is:

  1. Classification Error Rate

2. Gini Index

3. Entropy

How to choose best split?

  1. For region j, calculate the prior impurity I(before the split) and the post-split impurity for variable I(after the split).
  2. Choose the variable v that leads to the biggest reduction between I(before the split) and I(after the split).

Ideally, the would-be perfect ML classifier is to keep splitting until each unit has its own leaf, aka. J = N. However, this leads to over-fitting, which makes it less suitable for other datasets. In simply language, over-fitting means we fit our ML model too tightly with the dataset that we are using and less practical if we want to generalize to other cases.

To address this issue, we need to prune the model and place penalty each time the algorithm wants to take another split.

Pruning reduces the total misclassification error while keeping a smaller tree. It can be presented as follows:

cost = total misclassification error + αJ

  • α: tuning parameter
  • αJ: the penalty term

As a side note, this is a very common form of constructing a loss function as you may see in other scenarios.

R Implementations

1. R package, library, and load data

library(maptree)spam <- read_table2("spambase.tab", guess_max=2000)
spam <- spam %>%
  mutate(y = factor(y, levels=c(0,1), labels=c("good","spam"))) %>%
  mutate_at(.vars=vars(-y), .funs=scale)colnames(spam)

2. Data split: train and test

#set.seed() for version control
set.seed(1)#sample the dataset
test.indices = sample(1:nrow(spam), 1000) #create train and test sets
YTrain = spam.train$y
XTrain = spam.train %>% select(-y)
YTest = spam.test$y
XTest = spam.test %>% select(-y)

Next, we use 10-fold Cross-Validation. For a 6-step summary of CV, please refer to my other post (KNN).

nfold = 10
folds = seq.int(nrow(spam.train)) %>% # sequential observations IDs
  cut(breaks = nfold, labels=FALSE) %>% # sequential fold IDs

Let’s create a function, calc_error_rate, to calculate the classification errors.

calc_error_rate <- function(predicted.value, true.value){

Here comes the juicy part: build a simple DT model.

# the number = the row numbers of the spam.train 
nobs = nrow(spam.train)# a DT model
# please check the official R documents for the parameters
spamtree = tree(y~., data= spam.train,
     na.action = na.pass, 
     control = tree.control(nobs, mincut =2, minsize = 5, mindev = 1e-5))
summary(spamtree)# plot the original unpruned DT model
draw.tree(prune, nodeinfo=TRUE)
Figure 1

Normally, we do not plot untrimmed DT models because there are too many leafs to be interpretable. This is a bad example! I’m doing it for pedagogical reasons.

3. Pruning

To make the plot prettier, simpler, and interpretable, let’s prune the DT model with only 8 leafs left.

prune =prune.tree(spamtree, best=8)
summary(prune) # plot the pruned model
draw.tree(prune, nodeinfo=TRUE)
Figure 2

Compared to the previous crammed plot, Figure 2 is much better in every sense. We know what the most important variables are. We know how many observations are within each category.

In addition, we can use cross-validation to find the best number of pruning. Fortunately, the tree package includes a default CV function, cv.tree, to minimizes the misclassification rate.

cv = cv.tree(spamtree,FUN=prune.misclass, K=10)
Figure 3

The best number of splitting is 37. We plot the misclassification against the tree sizes. Setting the number of leaf = 37, we build a new tree model called spamtree.pruned.

spamtree.pruned<-prune.misclass(spamtree, best=37)
# training and test errors of spamtree.pruned
# set type = "class" because we are predicting the categorypred.train = predict(spamtree.pruned, spam.train, type=”class”)
pred.test = predict(spamtree.pruned, spam.test, type=”class”)# training error
DT_training_error <- calc_error_rate(predicted.value=pred.train, true.value=YTrain)
DT_training_error[1] 0.05165232# test error
DT_test_error <- calc_error_rate(predicted.value=pred.test, true.value=YTest)
DT_test_error[1] 0.072

That’s all! Folks, we have learned a super simple but useful ML classifier with 3 easy steps.

[Related Article: The Complete Guide to Decision Trees (part 2)]

In this post, we have learned what DT is, its merits, and R implementations. Here are some key takeaways:

  • The Industry wants quick simple solutions.
  • Decision Tree is quick, simple, and useful.
  • DT is visual and interpretable.

Originally Posted Here

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye