In this article, I provide an overview of the statistical learning technique called gradient boosting, and also the popular XGBoost implementation, the darling of Kaggle challenge competitors.
In general, gradient boosting is a supervised machine learning method for classification as well as regression problems. The overarching strategy involves producing a statistical learning model in the form of an ensemble of weak models, normally decision trees.
Although the roots of boosting methodology dates back to 1988, the conceptual framework was further developed by Jerome Friedman of Stanford and named Gradient Boosting Machines or GBM. His seminal paper, “Greedy Function Approximation: A Gradient Boosting Machine,” is a worthy read to gain a fundamental understanding of the process. R’s implementation, the GBM package, is based on Friedman’s original work.
R’s gbm() algorithm can be characterized in the following ways:
- Considered competitive with other high-performance algorithms like random forests
- Maintains reliable predictive accuracy where it is uncommon to produce lesser quality predictions than simpler models; also avoids nonsensical predictions
- No limit to the number of predictors
- Handles missing data
- Feature scaling is unnecessary
- Handles more factor levels than random forest (1024 vs. 32)
The learning process for GBMs involves sequentially fitting new models to provide a more finely tuned estimate of the response variable. The principal idea behind this algorithm is to create new base-learners that are correlated with the negative gradient of the loss function that’s associated with the entire ensemble.
XGBoost is the most recent evolution of gradient boosting. Let’s learn more about how XGBoost became king of the hill for data scientists desiring accurate predictions.
The XGBoost Story
More than half of the winning solutions in machine learning challenges hosted at Kaggle have used the popular open-source XGBoost algorithm (eXtreme Gradient BOOSTing). The source code for XGBoost can be found on GitHub.
XGBoost initially started as a research project by Tianqi Chen, a Ph.D. student in the University of Washington Department of Computer Science and Engineering. After winning the Higgs Boson Machine Learning Challenge, it became well known in the machine learning competition circles. Soon after, the Python and R packages were built, XGBoost now has packages for many other languages like Julia, Scala, Java, and others. XGBoost was first released in March, 2014.
In order to get the full story directly from the creator’s perspective, the video below is from my favorite local (Los Angeles) Meetup group Data Science LA. In March 2016, Tianqi Chen came to present his creation to a packed house. Chen’s original research paper is “XGBoost: A Scalable Tree Boosting System,” and I highly recommend reading it carefully in order to get a good perspective for why this algorithm works so well.
Better than Deep Learning: GBM
With all the hype about deep learning and AI, it’s not widely known that for structured/tabular data widely encountered in business applications it is GBM that most often achieves the highest accuracy in supervised learning tasks. The July 2018 video presentation below, also from the Data Science LA Meetup group, discusses some of the main GBM implementations available as R and Python packages such as XGBoost, h2o, and lightgbm. The discussion includes some of their main features and characteristics, along with insights into how tuning GBMs and creating ensembles of the best models can achieve the best prediction accuracy for many business problems.
To get a sense for some performance metrics with respect to scalability, speed, and accuracy of many of the most widely used binary classification algorithms, there is a GitHub repo containing some rather extensive benchmarks. The author, Szilard Pafka, has some compelling thoughts on deep learning vs. traditional supervised learning algorithms:
“What’s happening now is a new wave of hype, namely deep learning. The fanboys now think deep learning (or as they miscall it: AI) is the best solution to all machine learning problems. While deep learning has been extremely successful indeed on a few classes of data/machine learning problems such as involving images, speech and somewhat text (supervised learning) and games/virtual environments (reinforcement learning), in more “traditional” machine learning problems encountered in business such as fraud detection, credit scoring or churn (with structured/tabular data) deep learning is not as successful and it provides lower accuracy than random forests or gradient boosting machines (GBM).”