Statisticians have long known that two heads are better than one, but three is even better. Sir Francis Galton, the polymathic giant behind the pairwise correlation, regression, et al., writes about the surprising power of Vox Populi—“voice of the people” if you’re Roman—to make predictions about unknown quantities superior to that of solo predictions, so long as the decisions are diverse and independent. He highlights a 1907 British competition in which a crowd is asked to estimate the hanging weight of a nearby ox. The average of all guesses — some expert, mostly laymen—was off by less than 1%, beating any individual guess!
The machine learning community exploits this characteristic in a family of learning algorithms called ensembles, which enjoy a storied reputation for outperforming lone-wolf algorithms on Kaggle. In fact, it was an ensemble — predictions from 24 learners blended together by Gradient Boosted Decision Trees — that was named the victor of the Netflix Prize, a competition put on by the media streaming giant to improve the RMSE of their now extant movie recommender, Cinematch, by 10%. The prize was one million dollars.
This article is the first in a four-part series that introduces three popular ensemble methods: bagging, boosting, and stacking. This post threads the concepts and intuition of all three, while the subsequent three posts are a comprehensive dive into each method that will include runnable R code on both a toy binary classification and regression example.
Before proceeding, brush up on the bias/variance trade-off, bootstrapping, and the interpretability/accuracy trade-off if they’re fuzzy.
[Related article: Best Machine Learning Research of 2019]
The Intuition of How it Works
Consider the scenario in which your parents are machines. They only teach you binary, making your communication dependent on sending and receiving long strings of 1’s and 0’s. To English speakers, any two strings look virtually indistinguishable. You, though, you machine you, know better. But trouble arises even for a machine if just a single-digit gets flipped—signal corruption! — from 0 to 1 or vice versa.
But you’re a clever machine-speaking human, so you devise this solution that’s based on the idea that lightning rarely strikes the same spot twice, AKA signal corruption is pretty rare, so if a digit is erroneously flipped during one guess, it’s much, much rarer than an independent guesser guess the same digit incorrectly again, and it’s clear that the risk of absolute (or modal) corruption approaches 0 as you make more guesses from more independent guessers. Here’s the recipe you cook up:
Relay all incoming strings of binary multiple times, interpret each, then take a majority vote of all interpretations as your final answer. Do this for each binary chunk.
For example, say your RoboMa says to you:
Which obviously means,
“Unplug the toaster; I’m done.”
Left to your own devices, you miscode the first 0 as a1, making:
And therefore parse it as:
“Unplug me; I’m done.”
No RoboMa, NOOOOOO!!!! YOU’RE NOT OUTMODED YET!!!!!
But you’re obedient — you’re the offspring of machines, remember? — and oblige without hesitation, unplugging your mom, and live out the rest of your life burdened by the prospect of having possibly misheard what she said, therefore needlessly sending her to an early lifecycling. Such is your motivation to devise a better way to accurately decode binary classification.
So to avoid avoidable death, you decide to use three simple, random binary classifiers, each of which will interpret the same binary string in turn. Each classifier spits out 0’s 80% of the time and 1’s 20% of the time. On the back of an envelope, you do the math to check your hunch:
There are four plausible outcomes with three classifiers:
- 1. All three are right:
- 0.8 x 0.8 x 0.8 = 0.512
- 2. All three are wrong:
- 0.2 x 0.2 x 0.2 = 0.08
- 3. One is right:
- 0.8 x 0.2 x 0.2 +
- 0.2 x 0.2 x 0.8 +
- 0.2 x 0.8 x 0.2 = 0.096
- 4. Two are right:
- 0.8 x 0.8 x 0.2 +
- 0.8 x 0.2 x 0.8 +
- 0.2 x 0.8 x 0.8 = 0.384
The accuracy of any one classifier is 0.8, while the accuracy of the ensemble of all three is: 0.384 + 0.512 = 0.896
The ensemble is 9.6 percentage points more accurate than a solo classifier, even though the ensemble is made up of classifiers that are only 80% accurate.
You’re not alone if this feels magical. Making magic happen requires three assumptions though:
- Learners — our classifiers — are more right than wrong, on average. In our example, 0’s are identified 80% of the time, which being greater than 50%, is copacetic.
- Learners are diverse.I.e., different learners make different errors for the same observation. In our example, learners are random number generators.
- Learners are uncorrelated. ln our example, learners are random number generators, so correlation across all of them is also random.
(Note this example is a more colorful version of this.)
The Major Players
To Bag or To Boost?
- Bagging is a portmanteau for bootstrapped aggregation. It’s the process of getting predictions from many models, each of which has been trained on a different fold, and aggregating them. Each fold is sampled from the total available training data with replacement, which perturbs the overall signal and noise in each fold such that the ratio of each is different for each model. The idea is to reduce model variance by tolerating more model bias. Training fold creation is repeated so each base learner has a training set. Then the base learners are trained in parallel, or at the same time, and predictions are collected from each. In classification— binary or multi class—the output is decided upon by majority vote, with ties broken arbitrarily. In regression, the output is averaged. In both cases, the accuracy is superior to any constituent model.
- Boosting is incrementally fitting many models to reweighted versions of the training data, where the training data is reweighted based on how much the observation contributes to bias. More bias, more weight. This strengthens weak learners into strong learners.
Training in parallel that occurs in bagging aims to capitalize on the independence of each base learner, while the sequential training in boosting capitalizes on the dependence of the learners.
All bagging and boosting techniques will reduce overall generalization variance (because all are wise algorithmic crowds…), but only the boosting techniques try to explicitely reduce bias as well—possibly to a fault, which overfits to the training data—while bagging can actually ameliorate overfit problems by averaging predictions.
The toy illustration from the Kaggle ensembling guide shows this well:
Image source: Kaggle ensembling guide
The black line above separates the dots better than the green one. The green line is the decision boundary from an overfit model that got a little carried away in training. The model that learned the green line is said to have low bias (good), but high variance (not good). This means that the model is predicting close to the Truth Almighty, but its fit would change drastically if we hit redo on the data generating process and re-fit a new model.
By creating a multitude of lines like the green one and taking their average, the model gets closer to the black line, which generalizes to new data better. Combining all the green lines will create a separation boundary identical to the black line.
Since ensembling can be used for both regression and classification problems, how output gets combined depends on whether the output is numeric or nominal. But broadly speaking, predictions can only be combined two ways: every classifier has an equal opinion or the better ones get more say in the final vote. However, there are a number of different ways to cast the final vote.
By far the most common method to combine predictions from base learners is the simple average. The expected error of the averaged predictions should not exceed the averaged error of the base learners. In fact, if the residuals have mean 0 and are uncorrelated, the ensembled error should be less than this average by a factor equal to the number of base learners. In practice, though, this is rare, as the errors are often correlated because the base learners were trained on folds of the same training set. See page 68 in Ensemble Methods, Foundations and Algorithms for a detailed explanation.
Combining predictions from base learners by taking their weighted average simply gives some base learners more say in the final vote. How are weights determined?
The simplest, perhaps most exhaustive approach would be to grid search weight values between 0 and 1 for each ensemble member — or use an optimization procedure such as a linear solver or gradient descent optimization. However, you can get pretty creative with your combination method. Shahhosseini et al have pioneered a technique that I want to explore more here. It’s called Cross-validated Optimal Weighted Ensemble—COWE— and it gives the modeler control over the bias variance trade off by integrating what’s usually a two step process: tuning base learner hyperparameters and choosing weights for the learners.
So to weight or not to weight? While there’s no hard and fast rule, Herbrich and Graepel concede it’s widely accepted to simply average predictions of learners with similar performance while deploying weighted averaging only when they’re not. See page 70 in Ensemble Methods, Foundations and Algorithms for a detailed explanation.
Rather than calculating weights by way of grid search, if we use another meta learner to pick the weights for base learners, you’re using stacking. Stacking isn’t exclusively, or even technically, a combination method but an ensembling method in its own right, like bagging or boosting.
Base learners are combined in various stages via a stacking algorithm to turn a weak learner into a strong one. No fewer than two stages are possible. Stage 0 consists of at least two weak, uncorrelated learners, while the stage 1 combination method decides how to combine their predictions. If the combination method in stage 1 is simple averaging, we’re bagging.
Good to Know
Terminology for ensembling can be confusing; a random forest model is an ensemble model even if you don’t chain it with other models because it uses bagging internally to average the predictions of its trees. But it can also be ensembled with other models. Such an ensemble is often referred to as a heterogeneous ensemble.
- Think of everything as a hyper-parameter.
- How many base models should I use? Tuneable!
- How should I select features in each base model? Tuneable!
- How should I combine predictions? Tuneable!
- There are trade offs to ensembling, namely that accuracy comes at the expense of interpretability and computation expense. The winning algorithm of the Netflix Prize was never even put into production because it was computationally expensive!
- How many base learners should I use? As many as computationally feasible and accuracy approaches an asymptote. More on this in the article on bagging.
- Stacking and blending are different. Stacking and blending are technically different but so similar that sometimes, unfortunately, they’re used interchangeably.
[Related article: Ensemble Models Demystified]
This article summarized the intuition behind two popular ensembling methods, bagging and boosting. The subsequent three articles in this series will detail the theory and application of these, as well as their counterpart, stacking.
- Original 1992 paper by Wolpert on stacked generalization.
- Original 1994 paper by Bleiman on bagging
- Ensemble Methods, Foundations and Algorithms
- Stanford slides from Hastie.
- Understanding the bias variance tradeoff
- How ensemble methods work: bagging, boosting and stacking
- Good visualization of a simple weighted ensemble of logistic regression and RF. With R code