RAPIDS 0.7: Well We’re Moving On Up… RAPIDS 0.7: Well We’re Moving On Up…
RAPIDS 0.7 is live! Like the Jeffersons, RAPIDS is improving in many ways. RAPIDS is available more places than ever before,... RAPIDS 0.7: Well We’re Moving On Up…

RAPIDS 0.7 is live! Like the Jeffersons, RAPIDS is improving in many ways. RAPIDS is available more places than ever before, and XGBoost is now easier to use on multiple GPUs. So much to talk about, so let’s just jump in.

Big Gains in XGBoost

[Related Article: Gradient Boosting and XGBoost]

XGBoost is easier to use on multiple GPUs than ever before due to improvements in RAPIDS dask-cudf and dask-xgboost. We appreciate the community engagement so far, and we hope that as we introduce better documentation, more data scientists will be able to take advantage of these powerful tools. As with all of RAPIDS, these libraries are available as conda packages and in our containers on NGC and DockerHub. As a start, we’ve put together 10 Minutes to Dask-XGBoost, the latest in a series of notebook docs intended to get you started with RAPIDS basics in ten minutes. It’s also worth pointing out that XGBoost now supports a wide array of objective functions. Just to make sure there is no confusion, all RAPIDS modifications and improvements to XGBoost will be upstreamed into DMLC/XGBoost. While RAPIDS provides conda packages for XGBoost, this is only for early adopter convenience. NVIDIA believes in supporting and contributing back to the XGBoost library and many open source projects. Upstreaming the Dask-XGBoost features will take time; you can follow this PR to track the progress.

RAPIDS in the Clouds (… in the big leagues!)

Making it easy for people to use RAPIDS is always our first priority. This starts with training and evangelism. Google Cloud Platform (GCP) has helped tremendously with getting the word out about RAPIDS, and I want to thank GCP for generously donating compute instances to us during GTC San Jose. Thanks to GCP, a standing-room-only crowd was able to use RAPIDS to build models and predict demand on Black Friday. This was just the beginning of great things to come.

Google Colab, a hosted Jupyter-Notebook like service, recently began offering NVIDIA T4 GPUs. This allowed us to integrate RAPIDS into Google Colab, and now you can try RAPIDS for free! This blog will show you how easy it is. Google Colab is a great place to try out the complete RAPIDS suite and experiment with single-GPU jobs.

Finally, another Google milestone: last week Google Dataproc announced RAPIDS integration as a new Initialization during their Cloud OnAir: New open-source tools in Cloud Dataproc webinar. Currently in early beta, Dataproc users can easily and quickly configure an NVIDIA GPU cluster in the Google environment to try RAPIDS at scale. With Dataproc, you can leverage RAPIDS with dask-XGboost to use multiple GPUs and clusters of GPU nodes to train larger problems. Stay tuned for a detailed blog explaining how to get started.

In summary, RAPIDS is in the clouds — now on AzureDatabricks, and throughout Google platforms including: GCP DL VMColabDataproc, and KubeFlow. In addition, for AWS, and other clouds which are NGC-Ready, you can use the NGC Container to quickly try RAPIDS. So many places!

New Features and Improvements (… in usability and feature completeness!)

RAPIDS cuDF library 0.7 makes a bunch of things easier. There are now cumulative sum, product, min, and max functions for series. In fact, we did a complete overhaul of reduction operations on series on the C++ side. This fixed several bugs, added null support, improved datatype flexibility, and increased aggregation coverage for libcudf. We also added DataFrame.pop(), a great way to get a label column and a data matrix in one line. For reshaping your data, we’ve added multi-index support, including join and groupby functionality (including on strings, which is truly amazing), and the DataFrame.melt() method. See our cheatsheet for more info on melt. Min and max methods now support datetime columns. One additional improvement to mention, more cuDF functions support null / NA data. cuDF gets better every release, and we’re looking forward to release 0.8. A couple of things to get excited about are rolling window functionality and a GPU-accelerated to_csv() function.

In the RAPIDS cuML library, we’ve added two new methods on the Python side. One is brand new: a coordinate descent solver to fit lasso and elastic net regressions. The other is a big improvement: a completely rewritten single-GPU version of k-means built entirely on our machine learning primitives. Under the hood, we’ve done a lot to improve the code and have added C++ methods, like Quasi-Newtonian solvers and Random Forests, that will be exposed in Python in a later release.

cuGraph continues to improve and refine its code base with an eye towards matching the NetworkX API. Read the latest blog from the cuGraph team here. New analytics for version 0.7 are (1) enhancements to Jaccard Similarity to allow comparison between any vertex pairs; (2) the additional of the Overlap Coefficient as an alternative to Jaccard; (3) Triangle Counting; (4) Subgraph Extraction; (5) and Renumbering.

Finally, you asked for it, we did it: better error handling! This required rewriting Python bindings to the underlying libcudf library to use Cython, so now low-level errors are cleanly passed through to the end user for better diagnostics. Version 0.7 took the first steps toward fully rewriting bindings, and we’ll smooth out any resulting issues in subsequent releases

The Rule of Two (again)

In my last blog, I talked about the “rule of two” when it came to CUDA version support in RAPIDS. Beginning with version 0.7, we are instituting a new rule of two. We will only be supporting two installation formats for the foreseeable future: conda and source installation. After much thought and deliberation, we will not be supporting PIP. For more details on why we made this decision, please see this blog.

Getting Started Has Never Been Easier (… Have Your Piece of the Pie)

With our 0.7 release and in support of our growing community, we’d like to share our new Notebooks Extended repo on GitHub. Read this blog to learn more. You can think of this as the RAPIDS Community’s notebooks to provide data practitioners a place to grow their skills and teach others what they’ve learned. We expect that Notebooks Extended will be the go-to place for the latest tips and tricks and where budding RAPIDS practitioners grow and master RAPIDS.

If you’re interested in cybersecurity use cases, you owe it to yourself to read the recent blog post. And if you’re new to cuDF and Dask-cuDF, the blog walks you through what those packages do at the highest level to building an optimized ETL pipeline for large data sets.

RAPIDS in the Financial News!

We were excited to see that RAPIDS was a key part of an NVIDIA effort to blow away the previous best score on the STAC A3 benchmark, a measure of backtesting performance, which is critical in the financial services sector. To learn more about how RAPIDS is accelerating Python in banking, join us for a webinar on 6/13 with RAPIDS Senior Data Scientist @realpaulmahler.

Deep Learning with Tabular Data — RAPIDS with PyTorch

[Related Article: Training with PyTorch on Amazon SageMaker]

RAPIDS is making it possible to work with tabular data in deep learning. In a recent post, we explore how traditional machine learning approaches like XGboost compare in performance against deep learning DNNs. This first foray into deep learning for RAPIDS is a significant step that demonstrates we can achieve similar performance to XGboost with DNNs with a reasonably simple model.

Looking to 0.8 and Beyond

In 0.8, we’re working hard to release a single-GPU implementation of random forests, and we’re laying the groundwork for multi-node, multi-GPU k-means and random forests in 0.9. We’ve also been working with the OpenUCX community to integrate UCX into Dask. This is coming along very well, and we should have our first version of this out in 0.8 with more optimizations and support in 0.9 and 0.10.

As long as We Live, It’s You and Me Baby

If you’ve been thinking about trying out RAPIDS, you can get started on Google Colab in seconds. For returning users, there are so many ways to try the latest release, docs are improved, and there are numerous getting started notebooks to showcase the many new features in RAPIDS. We’re excited for you to join the community. If you like RAPIDS, please give it a star on GitHub, and file GitHub issues for problems or feature requests to make it even better. See y’all in 6 weeks!

[Originally posted here by Josh Patterson (@datametrician)]