This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at ODSC West 2022.
For this notebook you will need the standard imports:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
Side-By-Side Data Visualizations
This is a common technique you will find in many notebooks across GitHub, Kaggle, and others. However as excellent as this technique is, I wish more would use it.
Clinton Brownley, who presented a tour of machine learning in Python, offers a clear and concise example of the technique that elegantly solves a common use case — when you want to compare multiple distributions side-by-side.
# Specify fictional data to work with. r = pd.Series(skewnorm.rvs(a=4, loc=10, scale=4, size=1000))# Specify a four column subplot. fig, axes = plt.subplots(figsize=(20, 5), ncols=4)sns.distplot(r, ax=axes, kde=False, rug=False, fit=stats.norm).set_title('Original')sns.distplot(np.log(r), ax=axes, kde=False, rug=False, fit=stats.norm ).set_title('Natural Log')sns.distplot(np.sqrt(r), ax=axes, kde=False, rug=False, fit=stats.norm ).set_title('Square Root')sns.distplot(1/r, ax=axes, kde=False, rug=False, fit=stats.norm).set_title('Inverse')
Image Credit: Image generated from code snippets shown above. First published by Clinton Brownley.
A variation on this theme lets you view multiple distributions with multiple different vertical y-axes.
# Load example data. df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# Specify a two by two subplot + adjust spacing. sns.set_context('notebook')fig, axes = plt.subplots(figsize=(12, 4), ncols=2, nrows=2, squeeze=False) plt.subplots_adjust(hspace=0.8, wspace=0.3)sns.histplot(df['price'], ax=axes[0,0], stat='count', kde=True, color=my_blue ).set_title('Count')sns.histplot(df['price'], ax=axes[0,1], stat='frequency', kde=True, color=my_blue ).set_title('Frequency')sns.histplot(df['price'], ax=axes[1,0], stat='percent', kde=True, color=my_blue ).set_title('Percent')sns.histplot(df['price'], ax=axes[1,1], stat='density', kde=True, color=my_blue ).set_title('Density')
Note, to replicate these colors use the pallet strategies I wrote about here. This example also switches the layout from a 1 by 4 to a 2 by 2 display.
Image Credit: Image generated from code snippets shown above.
Transpose The Describe Method
As this is a personal favorite Pandas hack of mine I was glad to see multiple presenters at ODSC using.
Image credit: Author’s illustration built in Canva.
If you use
pd.describe() you know that it produces summary statistics. So, the problem with
pd.describe() is that it puts the variable names across the columns of the summary statistics table.
If you have many variables the table is unreadable… it will be too wide for the screen. To fix that, chain an additional method thus:
pd.describe().transpose() for the win!
Aggregate Methods On IsNull
In his presentations on Scikit Learn, Corey Wade offers multiple examples of this clever way to chain methods after
When inspecting a data frame for missing values you are likely familiar with the
pd.isnull() method. That method returns a copy of the data frame but where each value is True or False. True when missing. False, otherwise.
Several clever additional method chains make this output more readable. For example, perhaps you already knew about the
# Load example data. df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# View which variables have missing values, and how many. df.isnull().sum()
But did you know about the double sum? Using
pd.insull().sum().sum() will give the total missing values (or entries) across the entire data frame.
# View how many missing values there are in the entire df. df.isnull().sum().sum()
A less intuitive hack and result is taking the mean of boolean values. The mean of a boolean is equivalent to the proportion true.
Thus, for more useful output, expressing the “missingness “as a proportion of the observations is helpful. The
df.isnull().mean() method chain does this for you:
# View the proportion of values missing in each column. df.isnull().mean() * 100# Use lambda to spruce up the output. df.isnull().mean().apply(lambda x: str(round(x * 100, 1)) + "% Missing Values")
Quality Assurance Environmental Checks
Making sure your environment is ready to go, at the time of development, and then on through subsequent runs, is important. Stefanie Molin demonstrated a few hacks that do this well. She shared her implementation at her presentation on data visualization.
Image Credit: From Stefanie Molin’s data visualization in the Python repository.
As you can see from the image above, taken from Stefanie’s repository, this hack produces elegant and readable output. Placing this code in any repository will be a handy way to add a greater measure of quality assurance for you and those with whom you collaborate.
This article summarized some of the best snippits of code shared during ODSC’s West 2022 session in San Francisco. Specific examples included code that presents data visualziations side-by-side, the use of aggregate methods following the
pd.isnull() method, and also coding conventions that can check and re-check for proper environmental configuration.
If you missed ODSC West in San Francisco you should consider future editions. In just a few short weeks, information and registration for ODSC East in Boston will be available.
Thanks For Reading
Adam Ross Nelson is a data scientist + career coach. Read more about advancing your data science career: coaching.adamrossnelson.com.
Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. Twitter: @adamrossnelson | LinkedIn: Adam Ross Nelson| Facebook: Adam Ross Nelson.