Not Quite a Perfect Model Stack Not Quite a Perfect Model Stack
In model building, the power of the majority can be a great thing. For those scholars of democracy, this does not... Not Quite a Perfect Model Stack

In model building, the power of the majority can be a great thing. For those scholars of democracy, this does not refer to Alexis de Tocqueville’s tyranny in the power of majority. I apologize as that is probably a poor pun and may be a bit of a nerdy reference. Applying the power of the majority in machine learning allows for a model to combine the outputs of many learners. The concept of model stacking is built on ensemble modeling. Ensemble modeling usually combines many weak learners to arrive at a final prediction that is an improvement on any single weak learner. Random forest and gradient boosting machines employ methods of ensemble modeling. Stacking is an enhancement to the ensemble methods. Stacking combines the outputs of the ensemble models using some other learning algorithm to combine the predictions. Like any algorithm that is seeking to minimize error, the method used to stack the outputs of the ensemble models is itself trained to minimize the error of prediction. In this article, we’ll go over how to make a (not quite) perfect model stack. 

[Related Article: Ensemble Models Demystified]

Perfect Model Stack

The above image lays out a simple example of a stacked model. The first step is to develop the individual strong learners which often will consist of models from the Classification and Regression Tree (CART) world, regression models, ANN and DNN to name a few model types. A prediction is then generated for each of these models. These predictions are then fed into another learning model of which purpose is to discover the weights of these predictions for use in a final function that combines the predictions in a way that meets the objective of the problem (e.g. RMSE, Cross-entropy loss, etc.). Within each of the steps above, we can apply the standard approaches of model creation such as cross-validation, grid search, and training/test/validation splits. 

A Failed Example

So we have laid out the general process of model stacking. Usually, a successful demo would be shown next but I think it is useful for learning when we can see an outcome that fails to meet the expectations.

The best gain in performance, through stacking, is expected to occur when the various models used in the top-level return predictions that are not highly correlated. The advantage of this is that models that are not highly correlated should better balance their outputs across the different pockets of distribution in the data. To help cement this thought lets think of our models as a patient visiting a couple of doctors. The patient is complaining of foot cramps and stomach pains. So the patient goes to visit the doctors. The first of which is a foot doctor and the second is a stomach doctor. The foot doctor should do well at treating the feet. The foot doctor may have some knowledge of the stomach but is unlikely to have as much success in treating the stomach as the stomach doctor would. Now, we can bring a general practitioner who is not a specialist in any area but is knowledgeable in all areas. This general practitioner can take the recommendations of both the stomach doctor and the foot doctor and use these recommendations to smooth out any extreme views from either individual. The result is a treatment plan that should be more effective than the treatments of any individual doctor since the treatment uses the strengths of each doctor. In other words, the foot doctor and the stomach doctor are not correlated in their expertise and this allows for the general practitioner to administer a final treatment. 

To display an example of a stacked model that does not improve on the individuals I created a random forest and extreme gradient boosting models on the Rossman Store data from Kaggle.com. Using default parameters and a sample of the data the performance on the test data for each individual model is:

I then use the predictions from each model to train a new gradient boosting model and generate my stacked predictions. The outcome of the final stacked model is:

Right away we see that the resulting stacked model actually is doing worse than the xgbm model which has the best R squared value. The reason for this may be that the models could have been better tuned or a larger variety of models could have been used. I believe the largest factor in this is that the individual models are too similar to create an advantage when combined. 

The correlation of the predictions of the individual models is very high. This seems to indicate that each individual model is finding similar artifacts in the data and there is not much uniqueness to either model when compared together.

Ways to Improve the Stacked Model

How could we improve this? We could try to find alternative models to apply in the first level of the stack. Models that compliment each other and are less correlated. We could also try explore alternative methods of stacking the models together, this could be using a simple weighted average or even adding additional features to the level where we are combining the predictions. We could have also tried using a different model to combine the first level predictions. It may also be the case that this data set is not well suited to a stacked model. With any machine learning problem, it is necessary to explore multiple models as each data set is unique. 

Final Thoughts

As we saw, when applying machine learning to solve a problem there is usually not a single silver bullet that will always be the solution. Each data set is unique and testing of multiple methods must be applied to arrive at the best method. There are some packages and companies that have tried to automate this model identification but even this is can fall short of the best approach. Model stacking can be a simple way to improve on model performance when the data set allows for it. The MLWave blog has a great post on creating the perfect model stack. There are also several packages that are well suited to exploring model stacking. These include SuperLearner, subsemble and caretEnsemble. The book  Hands On Machine Learning With R by Brad Boehmke and Brandon Greenwell has a useful chapter on model stacking.

[Related Article: Top 7 Machine Learning Frameworks for 2020]

One final note: The data for this is from the Rossman Store Challenge on Kaggle.com

Want to learn more about exciting machine learning techniques? Attend ODSC East 2020 this April 13-17 and learn more in-person!

Jacey Heuer

Jacey Heuer

Jacey Heuer is a data scientist currently working in the retail and e-commerce industry. He holds master’s degrees from Iowa State University in business and data science. He has analytics experience across many industries including energy, financial services and real estate. Jacey also is a data science author publishing educational content for the PluralSight.com platform. He enjoys reading, writing and learning about all data science topics. His particular interests are in probability theory, machine learning, classical statistics and the practical application of it in business.