Machine learning models are powerful tools that do well in their purpose of prediction. In many business applications, the power of these models is quite beneficial. With any application of a machine learning model, the process to choosing which model involves determining the model that performs best across a given set of criteria. One of these criteria is the interpretability of the model. Neural nets to decision trees, to regression models all have varying levels of interpretability. In many business applications of machine learning, there must be a balance struck between predictive accuracy and interpretability of a model. In this article, we will discuss how Cubist models (in R) are useful in providing effective model interpretability while delivering powerful predictive performance.
[Related Article: Using an Embedding Matrix on Tabular Data in R]
Cubist models were developed by J.R Quinlan in the paper Learning with Continuous Classes (1992). The cubist model is also referred to as M5. The best description of the cubist model comes from Quinlan’s website rulequest.com and is as follows:
“Cubist is a powerful tool for generating rule-based models that balance the need for accurate prediction against the requirements of intelligibility. Cubist models generally give better results than those produced by simple techniques such as multivariate linear regression, while also being easier to understand than neural networks.”
This description makes clear the balance that cubist models offer between interpretability and predictive power.
Cubist models are a form of decision tree modeling that makes use of rules to subset the data. The primary algorithm contains two steps. The first step establishes a set of rules that divides the data into smaller subsets. The second part of the algorithm that applies a regression model to these smaller subsets to arrive at a prediction.
The predictions can be further augmented through the use of neighbor and committee aspects of the model. The neighbor function will apply a nearest neighbor algorithm to the leaf node and then use an ensemble approach combining the cubist prediction with the nearest neighbor prediction to arrive at a final output. The average prediction that is used in a decision tree is replaced with a regression model at the leaf node. The committee function has a similar benefit to boosting. The first cubist model makes a prediction and subsequent cubist models attempt to adjust for the errors made in the prior models.
Cubist Model in R
To demonstrate a cubist model, we will build a model in R using the rossman store sales data set from Kaggle.com. We first load our data, remove unwanted features and clean up the date values.
#load data – assumes pacman package is loaded already pacman::p_load('Cubist','readr','lubridate','dplyr') sales <- read_csv('train.csv') #clean up date sales$Date <- lubridate::date(sales$Date) sales$weekOfYear <- lubridate::week(sales$Date) sales$quarter <- lubridate::quarter(sales$Date) sales$month <- lubridate::month(sales$Date) sales <- sales %>% mutate(weekend = ifelse(DayOfWeek %in% c(6,7,1),1,0)) #determine columns to use sales <- sales[c(4,2,7,9:13)] #set response and explanatory variables resp <- sales$Sales pred <- sales[-1]
Now with the data prepared we can call the cubist function and generate our model.
#cubist model model_tree <- cubist(x = pred, y = resp) model_tree summary(model_tree)
We can call our model object to get a high level sense of the framework of the model.
> model_tree Call: cubist.default(x = pred, y = resp) Number of samples: 66900 Number of predictors: 7 Number of committees: 1 Number of rules: 17
We can see that there are 17 rules that have been found in the data. In other words, there are 17 subsets of the data that each have their own regression model. The model is a very rough model as we did not perform and tuning or take great care in the data preparation stages. We can take a look at slice of the individual models.
> summary(model_tree) Call: cubist.default(x = pred, y = resp) Cubist [Release 2.07 GPL Edition] Tue Nov 05 08:06:16 2019 --------------------------------- Target attribute `outcome' Read 66900 cases (8 attributes) from undefined.data Model: Rule 1: [935 cases, mean 166.2, range 0 to 26756, est err 166.5] if DayOfWeek > 3 DayOfWeek <= 4 SchoolHoliday > 0 weekOfYear > 50 then outcome = 0 Rule 2: [9135 cases, mean 206.9, range 0 to 37122, est err 206.9] if DayOfWeek > 6 then outcome = 224 - 32 DayOfWeek + 116 Promo Rule 3: [1069 cases, mean 948.6, range 0 to 32169, est err 428.9] if DayOfWeek > 4 DayOfWeek <= 6 SchoolHoliday > 0 weekOfYear > 50 then outcome = -28880 + 5776 DayOfWeek
Here are the first 3 rules generated by the model and each rules subsequent regression model. Each rule contains the criteria or tree that lead to the regression model. For each rule we can see the specific regression equation used within each rule. An interesting observation here is that rule 1 does not contain a regression model. When the cubist algorithm is evaluating the best models, it is including a model that uses only the median as one of the possible models. If the median performs best then this model is used for the rule.
We can also take a look at an overall evaluation of the model.
Evaluation on training data (66900 cases, sampled): Average |error| 1870.7 Relative |error| 0.57 Correlation coefficient 0.77 Attribute usage: Conds Model 100% 71% DayOfWeek 87% 78% weekOfYear 67% 46% Promo 11% 38% SchoolHoliday
We can see the average error which is the normal calculation of MSE. The relative error is ratio of the average error magnitude to the error magnitude that would result from always predicting the mean value. The correlation coefficient is the R-squared measurement. We can also see a breakdown of how the model features influence performance. The Conds column indicates how often the feature was used in a rule criteria and the Model column indicates how often a feature was used in the final model.
[Related Article: Data-Driven Exploration of the R User Community Worldwide]
As we have seen a cubist model does deliver balance between interpretability and predictive power. Cubist models are a great tool to add to a data scientist bag of tricks and can be very useful in business application. The models are relatively easy to follow which can be beneficial when convincing business leaders and users of the value the model can bring to the business.