Editor’s Note: Itai Yaffe and Daniel Haviv are speakers for ODSC East 2022. Be sure to check out their talk, “A bamboo of Pandas: crossing Pandas’ single-machine barrier with Apache Spark,” there!
Pandas is a fast and powerful open-source data analysis and manipulation framework written in Python. Apache Spark is an open-source unified analytics engine for distributed large-scale data processing. Both are widely adopted in the data engineering and data science communities.
Even though there’s a great value in combining them in terms of productivity, scalability, and performance, it’s often overlooked. In this blog post, we’ll give a sneak peek into combining these tools to enjoy the best of both worlds!
For the purpose of this example, we’ve used a 1.9GB CSV file with fire department calls’ data, obtained from https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 as of 11/11/2019.
First, let’s try to calculate the total number of calls per zip code, using Pandas:
import pandas as pd import time # Record the start time start = time.time() # Read the CSV file with the header pandasDF = pd.read_csv('/dbfs/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0) # Compute the total number of calls per zip code pandasDF.groupby('Zipcode of Incident')['Call Number'].count() # Record the end time end = time.time() print('Command took ', end - start, ' seconds')
As you can see from the screenshot above, this took roughly 40 seconds on an i3.xlarge machine (with 30.5GB RAM and 4 cores). Keep in mind this is a small dataset for example purposes.
Can we improve it? With Pandas API on Spark – we can!
Apache Spark is a distributed processing engine, which will allow us to easily parallelize the computation.
Pandas API on Spark is a Pandas’ API compatible drop-in replacement which provides Pandas’ users the benefits of Spark, with minimal code changes.
It is also useful for PySpark users by supporting tasks that are easier to accomplish using Pandas, like plotting an Apache Spark DataFrame.
Let’s try the same example, but this time – using Pandas API on Spark:
import pyspark.pandas as ps import time # Record the start time start = time.time() # Read the CSV file with the header pysparkDF = ps.read_csv('dbfs:/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0) # Compute the total number of calls per zip code pysparkDF.groupby('Zipcode of Incident')['Call Number'].count() # Record the end time end = time.time() print('Command took ', end - start, ' seconds')
Notice we only had to change the import pandas as pd to import pyspark.pandas as ps.
This time, it took only about 7 seconds, which can be attributed to the fact it is executed in a distributed manner (as opposed to Pandas). In this example, we used the same i3.xlarge machine (with 30.5GB RAM and 4 cores) as the cluster driver, and 4 i3.xlarge machines for the cluster workers.
Essentially, Spark divided the 1.9GB file into smaller chunks (which are called “partitions”), and all partitions were processed concurrently across all machines in the cluster.
That means Spark was able to run 16 tasks concurrently, as you can see below:
It’s important to note that the larger the dataset – the greater performance improvement you can expect (e.g think about what would have happened if we chose a 1900GB dataset rather than a 1.9GB dataset…).
Using the Pandas API on Spark is just one of the options available to Python developers to easily use Spark and enjoy the performance, scalability, and stability benefits it provides.
To learn more about Pandas API on Apache Spark and the other options available for you to integrate your Python code with Spark, join our session at the upcoming Open Data Science Conference East 2022 on April 20th, A Bamboo of Pandas: Crossing Pandas’ Single-Machine Barrier with Apache Spark, which will include demos and code examples.
About the authors/ODSC East 2022 Speakers on Apache Spark:
Prior to Databricks, Itai was a Principal Solutions Architect at Imply, and before that – a big data tech lead at Nielsen Identity, where he dealt with big data challenges using tools like Spark, Druid, Kafka, and others.
He is also a part of the Israeli chapter’s core team of Women in Big Data. Itai is keen on sharing his knowledge and has presented his real-life experience in various forums in the past.
Daniel Haviv has been working with a multitude of companies helping them solve their data challenges throughout his career, recently as a Senior Solutions Architect for Databricks and as an Analytics Specialist SA in AWS.