Why Do Tree Ensembles Work? Why Do Tree Ensembles Work?
Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We... Why Do Tree Ensembles Work?

Ensembles of decision trees (e.g., the random forest and AdaBoost algorithms) are powerful and well-known methods of classification and regression. We will survey work aimed at understanding the statistical properties of decision tree ensembles, with the goal of explaining why they work.

An elementary probabilistic motivation for ensemble methods comes from Hoeffding’s inequality, which describes the likelihood that a sum of independent random variables deviates by a specified amount from its expectation. The inequality implies that a sufficiently large collection of better-than-random classifiers, whose individual predictions are aggregated e.g. by majority rule, can be expected to have very high accuracy, provided their errors are independent. It is then natural to ask how one can encourage classifiers to make independent errors.

[Related article: Understanding the Hoeffding Inequality]

The random forest approach is to use randomness to encourage independence: by training trees on different subsets of the training set, and by injecting some randomness into the training procedure itself, one can construct a collection of decision trees whose majority vote provides far greater accuracy than any single decision tree could.

In the boosting approach, by contrast, one constructs a sequence of (usually short) trees on weighted forms of the training set, and misclassified training examples have their weights increased. Thus the next classifier is encouraged to focus on those training examples which have so far proven difficult to classify correctly. Weighted voting is used to aggregate the predictions.

In terms of the bias-variance decomposition of classifier error, the random forest apparently succeeds by averaging a collection of low-bias, high-variance classifiers (reducing the variance without disturbing the bias too much), whereas boosting – or at least some initial versions – forms a weighted combination of high-bias, low-variance classifiers (decision stumps or similar).

My talk at ODSC East 2019, “Why Do Tree Ensembles Work?” will attempt to shed some light on mysteries that are not quite explained by the preceding paragraphs. Some are as follows. Why can it happen that a boosted ensemble’s generalization performance can continue to improve even after the training error is already zero, and is it possible to overfit by performing too many boosting iterations? If it helps to grow the individual trees to large depth for either classifier type, what are the implications? Very broadly, why do these methods work so well?

[Related article: How Sampling Bias can Ruin Your Model]

To this end, we will explore several more advanced topics around interpretation and explanation of tree ensembles, such as:

  • the role of the margin: boosting can be seen as a margin equalizer, but the margin view does not perfectly align with observed generalization error;
  • a kernel interpretation of random forest: this provides a geometric interpretation of different regimes in the parameter space and the tradeoff between correlation (among the trees) and strength (of the individual trees);
  • a statistical characterization of the boosting procedure; and
  • interpolation (the ability to fit the training set perfectly): the idea that a well-chosen average of interpolating models can be resistant to overfitting; this may serve as a possible explanation for the success of both of these methods.


Learn more at ODSC East 2019 this April 30 to May 3 in my talk, “Why Do Tree Ensembles Work?

Joe Ross

Joe Ross holds a PhD in mathematics from Columbia University and was a researcher and instructor in pure mathematics, most recently at the University of Southern California. He has worked as a data scientist at machine learning/analytics startups for five years; in his current role, he focuses on a variety of time series (anomaly detection, forecasting, correlation) and sampling problems that arise in monitoring. https://www.linkedin.com/in/joseph-ross-6335b297/