NVIDIA GPUs and Apache Spark, One Step Closer NVIDIA GPUs and Apache Spark, One Step Closer
While RAPIDS started with a Python API focus, there are many who want to enjoy the same NVIDIA GPU acceleration in Apache Spark; in... NVIDIA GPUs and Apache Spark, One Step Closer

While RAPIDS started with a Python API focus, there are many who want to enjoy the same NVIDIA GPU acceleration in Apache Spark; in fact, we have many at NVIDIA. When RAPIDS first launched, we had a plan to accelerate Apache Spark as well as Dask, and we want to share some major accomplishments we’ve made over the last couple of months. Apache Spark is a leading big data platform, and our vision is to make NVIDIA GPUs a first class citizen. In addition, we want to empower the community to seamlessly choose their accelerator for data processing and machine learning workload across languages. With GPUs, users can exploit data parallelism through columnar processing instead of traditional row based reading designed initially for CPUs.

[Related Article: Faster deep learning with GPUs and Theano]

Four Accomplishments

Working with industry leaders in the open source community to lay the groundwork for our vision, we have:

  • submitted a proposal (SPARK-24615) and worked with industry leaders to upstream GPU-aware scheduling. This will enable users to natively schedule Apache Spark jobs on GPUs and available in Apache Spark 3.0.
  • started stage-level resource scheduling (SPARK-27495).
  • contributed a columnar processing proposal (SPARK-27396), that has been approved, and we are working on making the code changes to make the feature available in Apache Spark 3.0
  • released RAPIDS XGBoost4J-Spark package to GPU-accelerated end-to-end XGBoost pipelines on Apache Spark. You can follow the progress here. This will allow:
  1. XGBoost users to have native support for RAPIDS cuDF. You can follow the progress on this effort here.
  2. GPU isolation and assignment on multi-tenant GPU Cluster. This has been submitted to DMLC XGBoost library here
  3. External memory support in the future for XGBoost training on GPUs. Follow our progress here.

As always, our goal is to upstream these modifications into the appropriate libraries. Our goal is to build more bridges, and contribute to standards in the big data ecosystem.

Why Did We Start with XGBoost?

XGBoost is a popular gradient boosting library that solves many data science problems in a fast and accurate way. XGBoost4J-Spark is a leading mechanism for enterprises to conduct distributed machine learning training and inference. XGBoost4J-Spark enables XGBoost to train and inference data on Apache Spark across nodes.

What’s in the Release?

We are pleased to release RAPIDS XGBoost4J-Spark to accelerate end-to-end XGBoost pipelines on Apache Spark. With a single line code change, an existing Spark XGBoost application will be able to leverage GPUs to perform Spark data loading, XGBoost data conversion, and model training and inference.

Built upon the latest DMLC XGBoost code, RAPIDS XGBoost4J-Spark will contribute the following performance enhancements to the project:

  1. Enable Apache Spark 2.x clusters (YARN, Kubernetes, Standalone) to execute end-to-end XGBoost tasks on GPUs. This is early access to our GPU-aware scheduler. You can follow our progress here.
  2. Load datasets from a variety of sources (HDFS, cloud storage) into a RAPIDS cuDF DataFrame using our GPU-accelerated Apache Spark reader API. This is a sneak peak into our columnar processing SPIP.
  3. Convert RAPIDS cuDF DataFrame directly into XGBoost’s native DMatrix for GPU-based training and inference.

This release of RAPIDS Spark XGBoost provides data scientists and data engineers early access, and allows us to gather valuable feedback from the community. Eventually, all these enhancements will be merged into Apache Spark and DMLC XGBoost projects. We are excited that the GPU-aware scheduler enhancement has been accepted and merged to Apache Spark repo for Apache Spark 3.0 release. GPU dataframe support is being added through the Spark proposal here.

Benchmarking RAPIDS XGBoost4J-Spark:

XGBoost has integrated support for running across multiple GPUs, which can deliver even more significant improvements. We see an 8x performance improvement benchmarking XGBoost on GPUs comparing 2 NVIDIA T4 GPUs to 50 vCPUs with 300GB RAM on a Yarn cluster. We use RAPIDS cuDF reader to load CSV data into GPU memory directory. We then perform zero-copy conversion within GPU memory from cuDF Dataframe to D-matrix format. This eliminates host memory to GPU memory transfer. Finally, we train the XGBoost model in GPU.

How Do I Download and Use Distributed XGBoost4J?

Before you get started, please review our list of prerequisites for using the XGBoost4J-Spark on GPU package.

Hardware Requirements

  • NVIDIA Pascal™ GPU architecture or better
  • Multi-node clusters with homogenous GPU configuration

Software Requirements

  • Ubuntu 16.04/Ubuntu 18.04/CentOS 7
  • NVIDIA driver 410.48+
  • CUDA 9.2/10.0
  • NCCL 2.4.7
  • Spark 2.3.x/2.4.x

Get started using distributed in-memory training across multiple nodes and multiple GPUs using popular cluster manager frameworks on Spark such as YARN, Kubernetes and Standalone, refer to documentation links below for instructions.

We have made GPU accelerated XGBoost artifacts available at a public maven repo. You simply need to add the following dependency in your pom.xml. Classifier enables you to select the JAR based on your OS and NVIDIA CUDA version.


To see how XGBoost integrates with cuDF and Apache Spark, check out these sample applications and notebooksMortgage example notebookapplies Fannie Mae data to predict mortgage delinquency (a classification problem). Please follow our readme instructions to build sample Spark XGBoost applications either by spark-submit or using interactive notebooks.

[Related Article: Technical preview: Native GPU programming with CUDAnative.jl]

Give Us Your Feedback:

Given this is a beta release please help us by providing feedback on what you like and what you want to see in future releases so we can adjust our roadmap accordingly. Submit issues and join the conversation!

Originally Posted on Medium.com by Karthikeyan Rajendran