pandas works great on small datasets for quick prototyping, but often breaks when we have to work with large datasets.
Why Pandas ≠ Scale 🤦
This stems back to some of the fundamental design decisions around pandas. Pandas is an in-memory data structure, which means that you need to be able to fit the data that you’re working with into memory. As a result, you can easily run into out-of-memory errors on even moderately-sized datasets.
Introducing Modin: Scaling pandas by changing just a single line of code
To address these issues, we developed Modin — a scalable, drop-in replacement for pandas. Modin empowers practitioners to use pandas on data at scale, without requiring them to change a single line of code. Modin leverages our cutting-edge academic research on dataframes—the abstraction underlying pandas to bring the best of databases and distributed systems to dataframes. To use Modin, all you need to do is to replace your import statement as follows:
#import pandas as pd import modin.pandas as pd df = pd.read_csv("bigdata.csv")
Once you’ve changed your import statement, you’re ready to use Modin just like you would with pandas! You can easily get 4X speedup on your laptop without ever changing your pandas code!
Run pandas at scale on your data warehouse 🚀
Most enterprise data teams store their data in a database or data warehouse, such as Snowflake, BigQuery, or DuckDB. That’s a problem when you’re trying to work with that data in pandas because you have to pull the dataset into the memory of your machine, which can be slow, expensive, and lead to fatal out-of-memory issues.
Ponder solves this problem by translating your pandas code to SQL that can be understood by your data warehouse. The effect is that you get to use your favorite pandas API, but your data pipelines run on one of the most battle-tested and heavily-optimized data infrastructures today — databases.
Here’s how easy it is to get started:
import modin.pandas as pd import ponder ponder.init()
Then we set up a database connection. Ponder currently supports Snowflake, BigQuery and DuckDB:
# Using Snowflake's Python connector import snowflake.connector db_con = snowflake.connector.connect(user=**, password=**, account=**, role=**, database=**, schema=**, warehouse=**)
With Ponder, read_sql simply establishes a connection to your table in your data warehouse. Data stays in your warehouse and any subsequent pandas command you run gets executed as SQL queries in your warehouse.
df = pd.read_sql("DB_TABLE",db_con) # NOTE: Data doesn't get read into memory! # Run operations in pandas - all in Snowflake! df.describe() # Compute summary statistics
By running everything directly in the database, you inherit the scalability of your data warehouse. With Ponder, you can run pandas on more than a terabyte of data. We’ve shown that this can lead to massive workflow speedups by saving more than 2 hours of developer time when working with 150M row data on Snowflake and BigQuery.
About the author:
Doris Lee is the CEO and co-founder of Ponder. Doris received her Ph.D. from the UC Berkeley RISE Lab and School of Information in 2021, where she developed tools that help data scientists explore and understand their data. She is the recipient of Forbes 30 under 30 for Enterprise Technology in 2023.