fbpx
All the Best Parts of Pandas for Data Science All the Best Parts of Pandas for Data Science
Pandas has been hailed by many in the data science community as the missing link between Python and analysis, a tool... All the Best Parts of Pandas for Data Science

Pandas has been hailed by many in the data science community as the missing link between Python and analysis, a tool that can be leveraged in order to dramatically reduce overhead in data science projects, increase understandability and speed up workflows.

Pandas comes loaded with a wide range of built-in tools that make it easier for data scientists to dig into their datasets more quickly, plus develop pipelines capable of handling huge volumes of data. We’ll go over some of the features that make Pandas such a desirable solution to some of the major problems data scientists face.

Too Much Data? No Such Thing.

I was recently working on a project where I was attempting to load up a few gigabytes of Census data and run some straightforward analytics on it. At 12 Gb, I should have known that the naive solution would wind up inadequate, but I gave it a shot. Unsurprisingly, my VM frothed at the mouth for a few hours before kicking the bucket and throwing a memory error.

You can’t load massive datasets straightaway with Pandas (there is a known issue where Pandas buckles when too much data is dumped into a DataFrame). However, DataFrames have a nice workaround for dealing with huge datasets: chunking.

Chunking is a two-line solution for streaming data from a file into a DataFrame. Instead of pulling the entire file into memory and attempting to dump it into a data structure, chunking lets us read a predesignated number of rows into memory, making the task much more manageable and reasonable for systems with less memory.

This is the basic pattern for chunking:

import pandas as pd
mydata = pd.DataFrame()
for chunk in pd.read_csv(‘myfile.csv’, iterator=True, chunksize=5000): #Change chunksize according to your needs
    mydata = pd.concat([mydata, chunk], ignore_index=True)


Voila; whatever unreasonably large CSV file you were interested in reading will be pulled in piece by piece.

Slicing and Dicing the Dataframe

One of the other great benefits of Pandas is the way it handles data slicing and indexing, with Pythonic interfaces for selecting the data groupings you’re interested in.

Consider how you select a column, for instance. In most other applications I’ve worked with, I’ve created a dictionary mapping column headers to column indices, then used the dict to translate to the location. It usually looked something like this.

col_mapper = {
   ‘My First Column’ : 0,
   ‘My Second Column’ : 1
}
#Retrieve the second column
data[col_mapper[‘My Second Column’]]


No more of that. Pandas uses the CSV header to select columns.

data[‘My Second Column’]


Not only that, but Pandas allows you to select rows conditionally with a fairly clean syntax. Say you wanted to pull all rows where the second column has the value 0.

data.loc[data[‘My Second Column’] == 0]


If we wanted to do something more sophisticated, like every row where the second column is 0 and the first column is either ‘A’ or ‘B’, we can do that easily too.

data.loc[
  (data[‘My Second Column’] == 0) &
  (data[‘My First Column] == ‘A‘ | data[‘My First Column‘] == ‘B‘)

]


Data cleaning is also a cinch. Say you have some ugly table where the data starts five rows down, and the header is some mess of top-level headers and sub-headers (for example, ‘United States’ columns with sub-columns each for men and women in the United States). How would we deal with that?

data = pd.read_csv(‘myfile.csv’, skiprow=5)
data.columns = [‘My First Column’, ‘My Second Column’, ‘And So Forth’]

Analytics in Pandas

Data isn’t any good if we can’t actually analyze it. Thankfully, Pandas excels at performing simple analytics efficiently and in as little code as possible.

Say we wanted to calculate the midpoint between two columns, row-wise (e.g. row 1 has values 5 and 7, so its midpoint would be 6). We can create a new column with these values like this.

data[‘Midpoints’] = (data[‘FirstCol’] + data[‘SecondCol’]) / np.full(len(data), 2)


Seriously, it’s that easy. You’ll find it worthwhile to pull in numpy once in a while for some remedial calculations, but generally speaking, you can consider Pandas your more brain-friendly alternative.

What about aggregating by value in one column and summing the values of another?

data.aggregate(by=‘ColA’)[‘ColB’].sum()


Pandas is far and away the simplest, best-designed tool I’ve used for manipulating tabular data. It’s still in its infant stage (it’s currently on version 0.23), meaning there are bugs to be worked out. However, with a little cleverness, Pandas can do anything numpy can in a fraction of the time for the developer.


Ready to learn more data science skills and techniques in-person? Register for ODSC West this October 31 – November 3 now and hear from world-renowned names in data science and artificial intelligence!

Spencer Norris, ODSC

Spencer Norris is a data scientist and freelance journalist. He currently works as a contractor and publishes on his blog on Medium: https://medium.com/@spencernorris

1