XGBoost Gradient Boosting.
In the first part of this discussion on XGBoost, I set the foundation for understanding the basic components of boosting. In brief, boosting uses sequences of decision trees that seek to reduce the residuals of the prior tree. In other words, each new tree uses the residual of the prior tree as the target variable for the current tree. In doing so, for each new tree, greater focus is given to the larger errors of the prior trees. After enough trees are created, the sum of all tree predictions is calculated to generate the final predicted value.
As I mentioned in my prior post, in general, the benefits of XGBoost are distilled to seven benefits—there are more benefits but most of commentary points to these six as being most beneficial. These are:
- parallel processing
- flexibility in setting an objective function
- handling of missing values
- non-greedy tree pruning
- built-in cross validation and methods of dropout (DART).
In this post we will dig deeper into the methods of tuning the XGBoost algorithm and focus on the a few of the enhancements to GBM that XGBoost intends to provide. A primary advantage of XGBoost over GBM are the greater set of hyper parameters that are able to be tuned. A bit of caution here though is that while these additional tuning parameters are useful there is a bit of a practice needed in developing an understanding of the mix and levels of tuning parameters.
First a brief venture into the background and purpose of XGBoost. XGBoost was originally developed by Tianqi Chen in his paper titeled “XGBoost: A Scalable Tree Boosting System.” XGBoost itself is an enhancement to the gradient boosting algorithm created by Jerome H. Friedman in his paper titled “Greedy Function Approximation: A Gradient Boosting Machine.” Both papers are well worth exploring. The intent of XGBoost is summarized well on the landing page of its documentation website:
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
Since XGBoost is child of GBM, there are many similarities between the algorithms and their tuning parameters. Here is a short list of ways in which both algorithms are similar:
- Offer implementations of Classification and Regression Tree (CART) model framework.
- Inclusion of a form of gradient decent in minimizing the loss function
- Apply a learning rate to the loss function to optimize the path to the minimum loss (this is also known as shrinkage)
- Include base set of hyper parameters such as:
- Minimum sample size of features included in each split
- Minimum sample of size of each leaf
- Max depth of the trees also known as stump size
- Minimum sample size required in each node for splitting
- Number of trees to include
- Learning rate
- The ability to subsample the data for the creation of each tree (this is the stochastic part of stochastic gradient decent)
Three of the Primary Enhancements of XGBoost
Regularization comes in a handful of forms and in general is meant to reduce the potential of overfitting prediction. They also serve as a form of feature selection through the impact that regularization terms have on feature weights in the cost function. In other words, the regularization term reduces noise in the explanatory variables. This takes the form of the sum of features weights multiplied by some regularization term and added to the cost function. There are two common types of regularization.
- L1 (lasso) regularization which will push feature weights to zero and is represented by the alpha parameter. The left image above represent L1 regularization. It can be seen that the red ellipse will intersect the green regularization area at zero on the x-axis.
- L2 (ridge) regularization which will push feature weights asymptotically to zero and is represented by the lambda parameter. The right image above is L2 regularization. Unlike the L1 instance, the red ellipse is able to intersect the green regularization function at some value other than zero on the x-axis. The result is an L2 term that is not expected to ever reach a value of zero.
The use of which form of regularization is generally determined through cross-validation and parameter tuning.
Additionally, there is a gamma parameter which is meant to control the complexity of a given tree. The gamma parameter is used to set the minimum reduction in loss required to make a further split on a leaf node. XGBoost will use the gamma parameter in its pruning steps. It will grow the trees to a specified level and then use the gamma parameter in determining which leaf nodes to remove.
The parallelized development of the decision tree ensembles is another enhancement to the XGBoost algorithm. The algorithm is not growing the decision trees in parallel but rather uses pre-sorting and block storage methods in parallel processing to determine what the optimal split point will be. A linear scan can then be employed across the blocks to determine where the optimal split point is for each feature being tested. XGBoost will make use of a balance between in-memory and out-of-memory processes to make the most of the power the current machine has to offer.
The final enhancement I will mention here is the addition of drop out to the boosting algorithm. Dropout is another flavor of regularization that is intended to reduce overfitting. Dropout is the process of removing randomly selected trees from the decision tree ensemble created in the model building. In the XGBoost algorithm, this process is referred to as Dropout Additive Regression Trees (DART). The percentage of dropout to include is a parameter that can be set in the tuning of the model.
Wrapping Up — XGBoost : Gradient Boosting
The XGBoost algorithm has been a well received tool in the machine learning world. It is widely used and often finds its way to the top of data science competition leaderboards. The algorithm itself, although a bit complex to initially grasp, is easy to understand with a bit of practice. The next step from here would be to continue familiarizing yourself with the knobs and buttons that can be adjusted in the algorithm. When ready, it is also worth moving on and exploring the lightGBM framework of gradient boosting machines.