Captain America is great in many ways. He is not necessarily the strongest, nor the fastest. He cannot fly or shoot anything our his hands. However, he is a consistent leader that is well understood, admired and very effective. Also, when needed, he can even wield Thor’s hammer. So, the algorithm in XGBoost is machine learning ‘s Captain America, if you really think about it.
[Related Article: Some Details on Running xgboost]
This is a bit of silly comparison, many may disagree with it, but it is a fun thought that can help with remembering the components of various algorithms (have a look at “the memory palace” technique). The point is that XGBoost is generally more powerful than random forest while being more interpretable than a neural net. It is true that all modeling situations call for a unique assessment of the best method to use. In many situations, the XGBoost algorithm is often one of the best models to consider.
Extreme Gradient Boosting or XGBoost as it is commonly known, is founded on the framework of gradient boosting. The primary goal of the XGBoost algorithm is to enhance the speed and accuracy of boosted models. The authors of the XGBoost documentation state the purpose best by saying “The goal of this library [XGBoost] is to push the extreme of the computation limits of machines to provide a scalable, portable and accurate library.” In this article, we will discuss the foundations of the XGBoost algorithm. We will understand what gradient boosting is and briefly highlight how XGBoost pushes the performance boundaries of the traditional gradient boosting model.
Refresher on Boosting
Before venturing too far down the path of XGBoost, let us take a moment to define what boosting and gradient boosting are. Boosting models are a form of ensemble modeling using a decision tree in which the trees are formed sequentially. Each sequential tree will use the error of the prior tree as its target and seek to minimize the error of this target variable. More distinctly, a boosting model will first use some naive model to begin with. This is often the mean of the target variable. The error of each instance in the target vector is calculated. This error is then used as the target variable in the next iteration of the boosted model. This process continues until some number of trees is reached or some level of error is achieved. This is the process of boosting. Gradient boosting takes this a step further by adding a learning rate and the size of the sample (i.e. batch size) of each tree used in the boosting process. Boosting models are generally known to be quite accurate in their predictions but in doing so run the risk of high variance in overfitting.
We can look at a simple example to better understand the gradient boosting framework. For simplicity, we will avoid the concepts of information gain and Gini index used in the selection of decision tree nodes. This example is only concerned with the boosting aspect of the process. We start with a table containing the price of fruit given firmness, shape and weight.
The first step in a boosted tree would be to find the average of our target variable – which is price in this case. The average here is $6.93. We then calculate the error for each instance.
This error takes the place of our target variable in the step and we build a tree using the error as the target variable.
Our new tree may look like the following.
Now we add the prediction of the first round, which was the simple mean of our original Price feature, to the error predicted in the second round.
This is a very rough example of the boosted decision tree and in reality, the construction of the tree would be more accurate. However, the effort here is geared towards demonstrating the boosted process.
The boosted process can be simplified to the following:
Prediction=Simple Mean Price+1st tree error pred+…Nth tree error pred
There are two additional bits that account for the “gradient” component of gradient boosting. For each tree created, we calculate the gradient of the squared error and multiply by the learning rate.
So now we have both established the boosting process and the calculation of the gradient in each tree. The combination gives us a gradient boosted ensemble of trees.
The XGBOOST Framework
With our understanding of gradient boosting, we can take the next step and sort out ways to improve the speed and accuracy of the algorithm. This is the intent of the XGBoost. There are several ways in which XGBoost seeks to improve speed and performance. Most of the knobs and buttons used to tune XGBoost are focused on balancing bias and variance. Additionally, we can also tune the more common features of any tree algorithm which include tree depth, sample size, number of features and many other hyper-parameters.
[Related Article: The Best Machine Learning Research of September 2019]
In general, the benefits of XGBoost are distilled to seven benefits – there are more benefits but most of commentary points to these six as being most beneficial. These are regularization, parallel processing, flexibility in setting an objective function, handling of missing values, non-greedy tree pruning and built-in cross-validation and methods of dropout (DART).
We have covered the mechanics of a gradient boosting model. We have started detailing the enhancements that XGBoost brings to gradient boosting models. In my next article, we will walk through a demonstration of the XGBoost model and discuss more of the tuning of the hyper-parameters.