fbpx
Big Data Pipelines: Past, Present, and Future Big Data Pipelines: Past, Present, and Future
When data scientists convert research work to production machine learning tasks, a key challenge is how to schedule and organize computations... Big Data Pipelines: Past, Present, and Future

When data scientists convert research work to production machine learning tasks, a key challenge is how to schedule and organize computations so that results are reliably available and consistently accurate. This is the impetus for the world of tools that offer job scheduling and pipelining functionality. However, there are a lot of tools (widely used ones!) that have critical flaws that make life more difficult for data scientists. At Saturn Cloud, we are committed to making exceptionally fast, user-friendly, scalable computing available to any data scientist or machine learning practitioner, including offering the best big data data pipelining toolkit available.

Processing Data on a Schedule

But let’s step back a moment. Why are we talking about scheduling data science tasks? What functionalities do we need for this to be productive and sustainable?

A good data pipeline needs to process data in a specific manner at a particular time. However, this hides a lot of possible pitfalls, including:

  • What if the source data is missing?
  • What if a processing step fails?
  • What if a server is down?
  • How do we know a step is complete?
  • How do we tell a step to run at a given time?

It’s not just as simple as clicking one button, at least not under the hood. And this is just for data pipelines with what we would call “small” data. When you escalate to data that is so large that it can’t fit on one machine, or computations with massive complexity and resource needs, even more problems arise! If you have two or more tasks to run, all of which would require most of the computational resources at your disposal, what do you do?

History of Data Processing/Scheduling Tools

It’s interesting to take a step back and look at what our predecessors in data science did. There is a significant history to data pipelines and data processing tools, with many lessons to teach us.

If we go back to the very beginning, we can trace the first rudimentary options for scheduling computation tasks to the 1970s: Unix, cron, and make. Fast forward to the contemporary computing universe, and we have a profusion of new tools.

In terms of “big” data engineering, Hadoop (2006, Yahoo) is the groundbreaking framework, followed by such well known names as Hive (2009, Facebook), Spark (2010, UC Berkeley), Luigi (2012, Spotify), and Airflow (2014, Airbnb). On the other side of the data science ecosystem we find the PyData world, including NumPy (2006), pandas (2008), Xarray (2014), and Arrow (2016), among other tools for data structuring, manipulation, and storage.

All these tools learned from their predecessors, adding functionalities and improving features, bringing us to the intersection of Dask (2015) and Prefect (2017), which bring together the ability to easily handle unbelievable volumes of data and smooth compatibility with the PyData framework. 

Advances in Dask and Prefect

What makes Dask and Prefect revolutionary tools that deserve a closer look? Starting with Dask, we recognize that prior to now, scaling up to data volumes in the “big data” realm required setting aside the familiar paradigm of python in favor of learning whole new tools and languages. This creates an artificial barrier between data science and big data engineering, and Dask breaks down this barrier. Now data scientists can use the same pandas, scikit-learn, and NumPy syntax they already know for distributed computation and dramatic scaling of machine learning. The step from your local laptop to a multi-machine computing cluster is no longer like climbing Mount Everest, it’s more like walking up a staircase.

How does Prefect enter this picture? A lot of users reading this may have experience with Airflow, which is an incredibly powerful tool for scheduling and running computation tasks – and Airflow is still sometimes the appropriate tool for very high complexity use cases. But Prefect brings the bulk of the power of Airflow to a much more user friendly and accessible place. It was in fact designed by an Airflow developer, allows the new user to be up and productive quite a bit faster than Airflow does, and supports Dask for the parallel task execution that tool enables. A tool like Airflow, despite its full featured nature, is a steep learning curve to use, and Prefect takes away much of that struggle.

Additionally, Prefect’s tight integration with Dask and flexible API makes it extremely easy to use with RAPIDS, which adds the capability to accelerate, scale, and orchestrate data science workflows on NVIDIA GPUs. For more details, here’s a blog covering the topic.

Challenges of Spark and Airflow

To take a closer look at a handful of common challenges users of Spark and Airflow run into, and how Dask and Prefect solve them.

Spark

 

Spark Dask
Users have python experience, but not Spark Users can write python and get native parallelization
Data structure needs to use multidimensional arrays, but Spark doesn’t allow Multidimensional arrays (Dask Arrays and Dask Dataframes) are first class objects in Dask
Existing codebase is python, and migration is a huge challenge No need to massively revise existing python because Dask uses same API conventions
Speed of machine learning related computation is slow in Spark Dask is designed with machine learning, grid search in mind
Java backend means stack traces and error messages are unfamiliar in Spark Dask errors and stack traces are python based, so easier to interpret and respond to
Resource monitoring is not user friendly in Spark Dask has a built in, robust dashboard infrastructure that allows up to the minute job and resource monitoring
data frame ETL and XGBoost Dask provides end-to-end GPU support (ML, ETL, Streaming, Graph Analytics, Array Libraries, DL, Scientific Computing)

Airflow

Airflow Prefect
Difficult for inexperienced users to get started Simpler local startup, cloud option that handles infrastructure, and option for paid support
Certain scheduling options (concurrent runs, understanding “execution_date” is prior day’s run) are difficult or impossible Workflows need no schedule at all, and can be entirely ad hoc on the same or different data each time
Passing parameters into and around DAGs is a known challenge (see: XComs) Prefect makes arguments to workflows and tasks first class feature, and has built in handling for them

Saturn Cloud 

Saturn Cloud is proud to offer a platform that is explicitly designed to make all the advantages of Dask and Prefect available to even inexperienced users, along with easy and convenient access to multi-machine computing clusters. Data scientists who have experienced even a few of the struggles of Airflow and/or Spark will find that Saturn Cloud infrastructure gives you access to incredibly fast, reliable, and easy machine learning at any scale. We have a free trial and robust open source documentation to help you see for yourself how much this can change your working experience for the better. 

“Creating, debugging, and maintaining a distributed Dask cluster can be a painful challenge to overcome. Saturn solves these issues, enabling a Python-based Supercomputer that can easily handle your most compute-intensive workloads”

– Seth Weisberg, Principal Machine Learning Scientist at Senseye

To see an example of the possibility, readers might enjoy this post on how to use PyTorch and Dask on Saturn Cloud to conduct an image classification inference using the popular Resnet50 deep learning model on an NVIDIA GPU cluster. The tutorial demonstrates how to classify over 20 thousand images in roughly five minutes by shifting from single-node serial processing to multi-node multi-GPU computing, without ever needing to touch a Kubernetes installation or write a line of code outside python/PyData.

“Speed is essential for enabling data scientists to analyze the vast and growing amounts of information created every day. With RAPIDS and NVIDIA accelerated computing on Saturn Cloud Hosted, data scientists can access the tools and performance they need to do their best work.”

– Joshua Patterson, Senior Director of RAPIDS Engineering at NVIDIA


Article by Stephanie Kirmer and Aaron Richter, Ph.D

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1