In this post we will describe how to evaluate a predictive model. Why bother creating complex predictive models if 5% of the...

In this post we will describe how to evaluate a predictive model.

Why bother creating complex predictive models if 5% of the customers will churn anyway? Because a predictive model will rank our clients based on the probability that they  will abandon the company. It helps answer these two questions:

1. How should we optimise our resources?
2.  What targeted actions can we take before they leave?

Calling all the customers to keep that 5% drop, is much different than calling a fifth or less with the same results. We build predictive models to optimise these business processes.

Basically, you train a predictive algorithm using a set of variables called the training dataset. The idea is to predict the value of a target variable. There are two types of predictive problems: Classification and Regression. Classification is when the target variable has a limited number of values (discrete). Regression is when the target variable is continuous.

How do we know that a model is good enough? We evaluate them. We compare the predicted result vs the actual result. We calculate measures of evaluation based on these values, then decide which model is the best.

In classification problems we compare how many right and wrong predictions there are. And in regression problems we calculate how close the predicted value is to the actual value.

The algorithms use a training set to create a model that, hopefully, will generalise well enough the target variable. We use a test set of data, a different dataset from the training set, to test if the model predicts well. We want our models to perform well on generalizations, in order to predict future cases, not training cases.

There are several measures of error and the idea of this post is to provide a background to know how to use them and combine them.

There is a risk in the algorithm learning too well the training set, and performing bad on the testing set. We know this problem as over-fitting.

When the model performs badly on both the training and testing sets, we call this problem under-fitting. It is usually because the model is too simple and doesn’t generalize well.

Classification Measures

Accuracy is the most popular evaluation technique, but alone is not always the best measure of evaluation. Imagine we want to predict a binary variable taking ‘yes’ and ‘no’ values. Suppose there are too many yes’s and only few no’s to start. In this case, a constant model predicting mostly ‘yes’ will yield high accuracy. It will only be wrong on the few no’s, which I’m sure that is not what you expect. Accuracy as a measurement has a problem when the target variable is not balanced.

When we try to predict a discrete variable we can be right or wrong, there is no other state. We call it True Positives (TP) or True Negatives (TN) when we are right. And False Positives (FP) or False Negatives (FN) when we are wrong.

One of the simplest ways to visualise the result of a model with TP, TN, FP, and FN is a confusion matrix. It plots the predicted result on one axis and the actual results in the other. We expect to have a strong diagonal in the matrix, meaning, the predicted results concords with the actual results.

<p class=The following are some other measures:

  • False alarm ratio = False Positive ratio = FP / (FP + TN) is the percentage of negatives misclassified as positives.
  • Miss rate = False Negative rate = FN / (TP + FN) is the percentage of positives classified as negatives.
  • Recall = True Positive rate = TP / (TP + FN) is the percentage of positives classified correctly, is the same as saying (1 – Miss rate)
  • Precision = TP / (TP + FP) is the percentage of real positives out of total classified as positive by the model.
  • Specificity = True Negative rate = TN / (TN + FP)
  • Accuracy = (TP + TN) / (TP + TN + FP + FN)

The important message here is that “there is no sense in using only one measure to evaluate the result of a model.” If you have a 100% recall for model A, or a 0% false alarm for model B, it doesn’t mean those models perform well. You need to use the metrics in pairs, for example, Recall and Precision, or Miss rate and False alarm ratio, or TP/FP rate.

The Lift Chart width=

The lift chart is another great example of how to present the performance of a model. It is one of the simplest ways, for even for non-expert users, to communicate how a predictive model performs. 

Noted on the x-axis is the total of the dataset as a percentage. It is usually divided by deciles. And on the y-axis there is the number of times the model does better than a random selection. If 5% percent of the customers churn every year. We can can select any percentage of the customers and we will always get a 5% percent of them with a high probability of leaving. The baseline, equal to one, indicates the average of the target variable. It doesn’t matter how many records, we are always getting the same percentage of the target variable among them. So we build predictive models to identify that 5% percent among a reduced number of customers. The model ranks the customers by their likelihood to churn, so we can act upon them before they make the decision.

Here’s an example, we read: “the model is performing 5 (five) times better than a random selection in the first 10% of the database.” Meaning, instead of getting only 5% percent of churners in the first decile, the model identifies 25% of potential churners in the same amount of customers.

The baseline shows the performance of not having a model. Assuming a normal distribution in the data, you will always have 5% of the churners in any number of records. So the performance at any moment is equal to one, you can’t beat the 5%. 

Predictive models help beat random selection. We want to reach out to that 5% of churners in order keep them, but without having to meet with all the customers in the database. 

There are more ways to evaluate models, but the aim of the post is just to show the basic ones, so remember:

  • Use the confusion matrix to analyse the results.
  • Never use one metric alone, use at least two different metrics to evaluate a model.
  • Use the Lift Chart to present results to business users.



Diego Arenas

Diego Arenas, ODSC

I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.