fbpx
Getting Started with Pandas Getting Started with Pandas
Pandas is a popular data analysis library built on top of the Python programming language. It assists with common manipulations for data cleaning, joining,... Getting Started with Pandas

Pandas is a popular data analysis library built on top of the Python programming language. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential tool in the modern data analyst’s toolbox.

Pandas represents a fantastic step forward for graphical spreadsheet users who’d like to handle larger amounts of data, perform more complex operations, and automate the steps of their analysis routines. I like to introduce the tool as “Excel on steroids.”

Here’s some good news: you don’t need to be a software engineer to work effectively with the library. In fact, pandas offers Excel users a great bridge to get started with Python and programming in general. If you’ve never written a line of code before, you’ll be pleasantly surprised by how many spreadsheet operations already require you to think like a developer.

Let’s explore a sample dataset to see some of the library’s powerful features. If you’d like a deeper dive into the syntax and mechanics of pandas, tune in to my upcoming ODSC workshop this October, “Getting Started with Pandas for Data Analysis.”

We’ll start by importing pandas and assigning it an alias.

import pandas as pd

Our dataset is a CSV file of titles available on the online streaming service Netflix. Each row includes the title’s name, type, release year, duration, and the categories it’s listed in.

netflix = pd.read_csv("netflix_titles.csv")
netflix.head()

Let’s say we’re in the mood for a 90s comedy film. We can find the subset of rows that fit our criteria by applying filtering conditions to the type, release_year and listed_in columns. First up, let’s extract the rows with a value of “Movie” in the type column.

movies = netflix["type"] == "Movie"
netflix[movies].head()

Next up, let’s find our comedies. We’ll need to be a bit clever here. There are 4 categories in the listed_in column that we should include: “Comedies”, “Stand-Up Comedy” “TV Comedies”, and “Stand-Up Comedy & Talk Shows“. These categories can also be nested amongst other non-related categories. We can use regular expressions to identify the titles whose listed_in text includes the substring “Comed” followed by any characters.

comedies = netflix["listed_in"].str.contains(r'Comed.*')
netflix[comedies].head()

To target titles released in the 90s, we can filter the values in the release_year column to those that fall between a range, from 1990 to 1999.

made_in_nineties = netflix["release_year"].between(1990, 1999)
netflix[made_in_nineties].head()

We’ve now declared three individual conditions to filter the dataset. The final step is to apply all three conditions together. In the next example, we ask pandas for all titles that are movies and comedies and released between 1990 and 1999.

netflix[movies & comedies & made_in_nineties].head()

For the cherry on top, let’s sort the movies by year of release. We’ll sort the numeric values in the release_year column in ascending order.

netflix[movies & comedies & made_in_nineties].sort_values("release_year").head()

All that’s left to do now is select a title and start binging. With only a couple lines of code, we’ve narrowed down a 6,000+ row dataset to 60 titles that fit out interests.

For a more in-depth overview of Pandas, check out my upcoming ODSC talk this October, “Getting Started with Pandas for Data Analysis.” We’ll explore several real-world datasets and walk through many of the features of this powerful data analysis tool.


About the author/ODSC West 2020 speaker: Boris Paskhaver is a full-stack web developer based in New York City with experience building apps in React / Redux and Ruby on Rails. His favorite part of programming is the never-ending sense that there’s always something new to master — a secret language feature, a popular design pattern, an emerging library or — most importantly — a different way of looking at a problem.

ODSC Community

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1