Getting Up to Speed with XGBoost in R
ModelingXGBoostposted by Daniel Gutierrez, ODSC October 22, 2018 Daniel Gutierrez, ODSC
In this article we’ll take a brief tour of the XGBoost package in R. Previously, I wrote an overview article “Gradient Boosting and XGBoost” which you might wish to read first in order to get some background before digesting the code in this article.
As a quick launch pad for this article, XGBoost is an abbreviation for eXtreme Gradient Boosting. It is a fresh, new implementation of the gradient boosting framework first described by Jerome Friedman of the Stanford University Statistics Department in 2001. XGBoost was first released in 2015 and offers a high level of efficiency and scalability. The package includes an efficient linear model solver and tree learning algorithm. It supports various objective functions including regression, classification, and ranking. The package is extensible in that data scientists easily can define their own objectives. It has become the “go to” tool for many of my data science projects. In this article, we’ll review some R code that demonstrates a typical use of XGBoost.
XGBoost in R
The R code below uses the XGBoost package in R, along with a couple of my other favorite packages. I like using the caret (Classification and Regression Training) ever since I saw its primary author Max Kuhn speak at the 2015 useR! Conference (Max is amazing). I also enjoy using tidyverse ever since I saw its author Hadley Wickham speak at the same conference and also at a couple of local Meetup groups in my hometown of Los Angeles (Hadley also is amazing).
We’ll use the Combined Cycle Power Plant dataset from UCI ML repository. It has 9,568 observations collected from a Combined Cycle Power Plant over 6 years (20062011), when the plant was set to work with a full load, along with 5 continuous variables:

AT = Atmospheric Temperature in C

V = Exhaust Vacuum Speed

AP = Atmospheric Pressure

RH = Relative Humidity

PE = Power Output. This is a response variable whose value we will try to predict given the measurements above.
In order to fully understand how all the tuning parameters work for XGBoost, I strongly suggest you carefully read the research papers referenced in my previous article (the theory and the mathematics are important).
# xgboost with the caret kernal
library(xlsx)
library(tidyverse)
library(xgboost)
library(caret)
# Data set: Combined Cycle Power Plant Data Set from
# UCI ML repository
# Download from:
# https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
# Read downloaded Excel file
CCPP < read.xlsx2(“Folds5x2_pp.xlsx”,
colClasses=c(rep(“numeric”, 5)), sheetIndex=1)
# Set seed for experimental reproducibility
set.seed(314)
# Create index for testing and training data
n < nrow(CCPP) # Number of observations
ntrain < round(n*0.75) # 75% for training set
tindex < sample(n, ntrain) # Create an index
train_CCPP < CCPP[tindex,] # Create training set
test_CCPP < CCPP[tindex,] # Create test set
# xgb.DMatrix class for xgboost
X_train = xgb.DMatrix(as.matrix(train_CCPP %>% select(PE)))
y_train = train_CCPP$PE # Training set reponse label
X_test = xgb.DMatrix(as.matrix(test_CCPP %>% select(PE)))
y_test = test_CCPP$PE # Test set response label
# Use caret trainControl to control the computational nuances of
# the train function
xgb_trainControl = trainControl(
method = “cv”,
number = 5,
returnData = FALSE
)
# Set up data frame with tuning parameters:
# nrounds – Number of Boosting Iterations
# max_depth – Max Tree Depth
# colsample_bytree – Subsample Ratio of Columns
# eta – Shrinkage
# gamma – Minimum Loss Reduction
# min_child_weight – Minimum Sum of Instance Weight
# subsample – Subsample Percentage
xgb_grid < expand.grid(nrounds = 100,
max_depth = 10,
eta = 0.1,
gamma=0,
colsample_bytree = 0.9,
min_child_weight = 1,
subsample = 1
)
# Use train from caret
# Select xgbTree for using xgboost
set.seed(0)
xgb1 = train(
X_train, y_train,
trControl = xgb_trainControl,
tuneGrid = xgb_grid,
method = “xgbTree”
)
Evaluate Model
Next, we need to evaluate the model using a couple of widely used metrics, Root Mean Square Error (RMSE), and Rsquared. We’ll use the predict function to deliver the predicted values and then from there we calculate the residuals.
# Calculate the root mean square error (RMSE) for test set
pred = predict(xgb1, X_test)
residuals = y_test – pred
RMSE = sqrt(mean(residuals^2))
print(paste0(“RMSE = “, round(RMSE,4)))
# Calculate Rsquared for test set
y_test_mean = mean(y_test)
# Calculate total sum of squares
TSS = sum((y_test – y_test_mean)^2 )
# Calculate residual sum of squares
RSS = sum(residuals^2)
# Calculate Rsquared
R_squared = 1 – (RSS/TSS)
print(paste0(“Rsquared = “, round(R_squared,3)))
Observed vs Predicted Plot
Finally, we can do the typical actual versus predicted plot to visualize the results of the model.
# Plot observed vs. predicted with linear fit
plot(pred, y_test, pch=16, col=”blue”, cex=0.75,
xlab=”Predicted Power Output”,
ylab=”Observed Power Output”,
main=”XGBOOST: Observed vs. Predicted”)
lines(pred,
lm(a~b, data=data.frame(a=y_test, b=pred))$fitted,
lwd=2, col=”red”)
To learn more about GLM, attend the workshop at ODSC west: Training Session: Machine Learning in R Part II