Getting Up to Speed with XGBoost in R Getting Up to Speed with XGBoost in R
In this article we’ll take a brief tour of the XGBoost package in R. Previously, I wrote an overview article “Gradient Boosting and XGBoost”... Getting Up to Speed with XGBoost in R

In this article we’ll take a brief tour of the XGBoost package in R. Previously, I wrote an overview article “Gradient Boosting and XGBoost” which you might wish to read first in order to get some background before digesting the code in this article.


As a quick launch pad for this article, XGBoost is an abbreviation for
eXtreme Gradient Boosting. It is a fresh, new implementation of the gradient boosting framework first described by Jerome Friedman of the Stanford University Statistics Department in 2001. XGBoost was first released in 2015 and offers a high level of efficiency and scalability. The package includes an efficient linear model solver and tree learning algorithm. It supports various objective functions including regression, classification, and ranking. The package is extensible in that data scientists easily can define their own objectives. It has become the “go to” tool for many of my data science projects. In this article, we’ll review some R code that demonstrates a typical use of XGBoost.

 

XGBoost in R

 

The R code below uses the XGBoost package in R, along with a couple of my other favorite packages. I like using the caret (Classification and Regression Training) ever since I saw its primary author Max Kuhn speak at the 2015 useR! Conference (Max is amazing). I also enjoy using tidyverse ever since I saw its author Hadley Wickham speak at the same conference and also at a couple of local Meetup groups in my hometown of Los Angeles (Hadley also is amazing).

 

We’ll use the Combined Cycle Power Plant dataset from UCI ML repository. It has 9,568 observations collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the plant was set to work with a full load, along with 5 continuous variables:

  • AT = Atmospheric Temperature in C

  • V = Exhaust Vacuum Speed

  • AP = Atmospheric Pressure

  • RH = Relative Humidity

  • PE = Power Output. This is a response variable whose value we will try to predict given the measurements above.

In order to fully understand how all the tuning parameters work for XGBoost, I strongly suggest you carefully read the research papers referenced in my previous article (the theory and the mathematics are important).

 

# xgboost with the caret kernal

library(xlsx)

library(tidyverse)

library(xgboost)

library(caret)

 

 

# Data set: Combined Cycle Power Plant Data Set from

# UCI ML repository

# Download from:

# https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

 

# Read downloaded Excel file

CCPP <- read.xlsx2(“Folds5x2_pp.xlsx”,

colClasses=c(rep(“numeric”, 5)), sheetIndex=1)

 

# Set seed for experimental reproducibility

set.seed(314)  

 

# Create index for testing and training data

n <- nrow(CCPP)  # Number of observations

ntrain <- round(n*0.75)  # 75% for training set

 

tindex <- sample(n, ntrain)   # Create an index

train_CCPP <- CCPP[tindex,]   # Create training set

test_CCPP <- CCPP[-tindex,]   # Create test set

 

 

# xgb.DMatrix class for xgboost

X_train = xgb.DMatrix(as.matrix(train_CCPP %>% select(-PE)))

y_train = train_CCPP$PE  # Training set reponse label

X_test = xgb.DMatrix(as.matrix(test_CCPP %>% select(-PE)))

y_test = test_CCPP$PE  # Test set response label

 

# Use caret trainControl to control the computational nuances of

# the train function

xgb_trainControl = trainControl(

 method = “cv”,

 number = 5,  

 returnData = FALSE

)

 

# Set up data frame with tuning parameters:

# nrounds – Number of Boosting Iterations

# max_depth – Max Tree Depth

# colsample_bytree – Subsample Ratio of Columns

# eta – Shrinkage

# gamma – Minimum Loss Reduction

# min_child_weight – Minimum Sum of Instance Weight

# subsample – Subsample Percentage

xgb_grid <- expand.grid(nrounds = 100,  

                       max_depth = 10,

                       eta = 0.1,

                       gamma=0,

                       colsample_bytree = 0.9,

                       min_child_weight = 1,

                       subsample = 1

)

 

# Use train from caret

# Select xgbTree for using xgboost

set.seed(0)

xgb1 = train(

 X_train, y_train,  

 trControl = xgb_trainControl,

 tuneGrid = xgb_grid,

 method = “xgbTree”

)

 

 

Evaluate Model

 

Next, we need to evaluate the model using a couple of widely used metrics, Root Mean Square Error (RMSE), and R-squared. We’ll use the predict function to deliver the predicted values and then from there we calculate the residuals.

 

# Calculate the root mean square error (RMSE) for test set

pred = predict(xgb1, X_test)

residuals = y_test – pred

RMSE = sqrt(mean(residuals^2))

print(paste0(“RMSE = “, round(RMSE,4)))

 

# Calculate R-squared for test set

y_test_mean = mean(y_test)

# Calculate total sum of squares

TSS =  sum((y_test – y_test_mean)^2 )

# Calculate residual sum of squares

RSS =  sum(residuals^2)

# Calculate R-squared

R_squared  = 1 – (RSS/TSS)

print(paste0(“R-squared = “, round(R_squared,3)))

 

 

Observed vs Predicted Plot

 

Finally, we can do the typical actual versus predicted plot to visualize the results of the model.

 

# Plot observed vs. predicted with linear fit

plot(pred, y_test, pch=16, col=”blue”, cex=0.75,

xlab=”Predicted Power Output”,

ylab=”Observed Power Output”,

main=”XGBOOST: Observed vs. Predicted”)

lines(pred,

     lm(a~b, data=data.frame(a=y_test, b=pred))$fitted,

     lwd=2, col=”red”)

 


To learn more about GLM, attend the workshop at ODSC west: Training Session: Machine Learning in R Part II

Daniel Gutierrez, ODSC

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.