Scaling LightGBM with Dask
LightGBM is an open-source framework for solving supervised learning problems with gradient-boosted decision trees (GBDTs). It ships with built-in support for distributed training, which just means “using multiple machines at the same time to train a model”. Distributed training can allow you to train on larger... Read more
Getting Started with Dask and SQL
Lots of people talk about “democratizing” data science and machine learning. What could be more democratic — in the sense of widely accessible — than SQL, PyData, and scaling data science to larger datasets and models? Dask is rapidly becoming a go-to technology for scalable computing.... Read more
Dask in the Cloud
When doing data science and/or machine learning, it is becoming increasingly common to need to scale up your analyses to larger datasets. When working in Python and the PyData ecosystem, Dask is a popular tool for doing so. There are many reasons for this, one being... Read more
Coiled: Dask for Everyone, Everywhere
Data scientists increasingly solve large machine learning and data problems with Python. But historically Python struggled with parallel computing. This led many of us in the community to make Dask, a library for parallel computing and data science for Python. Dask has been a go-to solution... Read more
Parallelizing Custom CuPy Kernels with Dask
Summary Some time ago, Matthew Rocklin wrote a post on Numba Stencils with Dask, demonstrating how to use them for both CPUs and GPUs. This post will present a similar approach to writing custom code, this time with user-defined custom kernels in CuPy. The motivation for this post comes from... Read more
10 Minutes to cuDF and Dask cuDF
Centered around Apache Arrow DataFrames on the GPU, RAPIDS is designed to enable end-to-end data science and analytics on GPUs. Together, open source libraries like RAPIDS cuDF and Dask let users process tabular data on GPUs at scale with a familiar, pandas-like API. With Dask, anything... Read more