Pandas is a popular data analysis library built on top of the Python programming language. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential tool in the modern data analyst’s toolbox.
Pandas represents a fantastic step forward for graphical spreadsheet users who’d like to handle larger amounts of data, perform more complex operations, and automate the steps of their analysis routines. I like to introduce the tool as “Excel on steroids.”
Here’s some good news: you don’t need to be a software engineer to work effectively with the library. In fact, pandas offers Excel users a great bridge to get started with Python and programming in general. If you’ve never written a line of code before, you’ll be pleasantly surprised by how many spreadsheet operations already require you to think like a developer.
Let’s explore a sample dataset to see some of the library’s powerful features. If you’d like a deeper dive into the syntax and mechanics of pandas, tune in to my upcoming ODSC workshop this October, “Getting Started with Pandas for Data Analysis.”
We’ll start by importing pandas and assigning it an alias.
import pandas as pd
Our dataset is a CSV file of titles available on the online streaming service Netflix. Each row includes the title’s name, type, release year, duration, and the categories it’s listed in.
netflix = pd.read_csv("netflix_titles.csv") netflix.head()
Let’s say we’re in the mood for a 90s comedy film. We can find the subset of rows that fit our criteria by applying filtering conditions to the type, release_year and listed_in columns. First up, let’s extract the rows with a value of “Movie” in the type column.
movies = netflix["type"] == "Movie" netflix[movies].head()
Next up, let’s find our comedies. We’ll need to be a bit clever here. There are 4 categories in the listed_in column that we should include: “Comedies”, “Stand-Up Comedy” “TV Comedies”, and “Stand-Up Comedy & Talk Shows“. These categories can also be nested amongst other non-related categories. We can use regular expressions to identify the titles whose listed_in text includes the substring “Comed” followed by any characters.
comedies = netflix["listed_in"].str.contains(r'Comed.*') netflix[comedies].head()
made_in_nineties = netflix["release_year"].between(1990, 1999) netflix[made_in_nineties].head()
We’ve now declared three individual conditions to filter the dataset. The final step is to apply all three conditions together. In the next example, we ask pandas for all titles that are movies and comedies and released between 1990 and 1999.
netflix[movies & comedies & made_in_nineties].head()
netflix[movies & comedies & made_in_nineties].sort_values("release_year").head()
For a more in-depth overview of Pandas, check out my upcoming ODSC talk this October, “Getting Started with Pandas for Data Analysis.” We’ll explore several real-world datasets and walk through many of the features of this powerful data analysis tool.
About the author/ODSC West 2020 speaker: Boris Paskhaver is a full-stack web developer based in New York City with experience building apps in React / Redux and Ruby on Rails. His favorite part of programming is the never-ending sense that there’s always something new to master — a secret language feature, a popular design pattern, an emerging library or — most importantly — a different way of looking at a problem.