fbpx
Standout Code Snippets From ODSC West 2022 Standout Code Snippets From ODSC West 2022
This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at... Standout Code Snippets From ODSC West 2022

This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at ODSC West 2022.

For this notebook you will need the standard imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Side-By-Side Data Visualizations

This is a common technique you will find in many notebooks across GitHub, Kaggle, and others. However as excellent as this technique is, I wish more would use it.

Clinton Brownley, who presented a tour of machine learning in Python, offers a clear and concise example of the technique that elegantly solves a common use case — when you want to compare multiple distributions side-by-side.

# Specify fictional data to work with.
r = pd.Series(skewnorm.rvs(a=4, loc=10, scale=4, size=1000))# Specify a four column subplot.
fig, axes = plt.subplots(figsize=(20, 5), ncols=4)sns.distplot(r,          ax=axes[0], 
                         kde=False, rug=False, 
                         fit=stats.norm).set_title('Original')sns.distplot(np.log(r),  ax=axes[1], 
                         kde=False, rug=False, 
                         fit=stats.norm
                         ).set_title('Natural Log')sns.distplot(np.sqrt(r), ax=axes[2], 
                         kde=False, rug=False, 
                         fit=stats.norm
                         ).set_title('Square Root')sns.distplot(1/r,        ax=axes[3], 
                         kde=False, rug=False,
                         fit=stats.norm).set_title('Inverse')
Four distribution polots shown side-by-side.

Image Credit: Image generated from code snippets shown above. First published by Clinton Brownley.

A variation on this theme lets you view multiple distributions with multiple different vertical y-axes.

# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# Specify a two by two subplot + adjust spacing.
sns.set_context('notebook')fig, axes = plt.subplots(figsize=(12, 4), ncols=2,
                         nrows=2, squeeze=False)
plt.subplots_adjust(hspace=0.8, wspace=0.3)sns.histplot(df['price'],
             ax=axes[0,0],
             stat='count',
             kde=True, color=my_blue
             ).set_title('Count')sns.histplot(df['price'],
             ax=axes[0,1],
             stat='frequency',
             kde=True, color=my_blue
             ).set_title('Frequency')sns.histplot(df['price'],
             ax=axes[1,0],
             stat='percent',
             kde=True, color=my_blue
             ).set_title('Percent')sns.histplot(df['price'],
             ax=axes[1,1],
             stat='density',
             kde=True, color=my_blue
             ).set_title('Density')

Note, to replicate these colors use the pallet strategies I wrote about here. This example also switches the layout from a 1 by 4 to a 2 by 2 display.

Four distribution plots.

Image Credit: Image generated from code snippets shown above.

Transpose The Describe Method

As this is a personal favorite Pandas hack of mine I was glad to see multiple presenters at ODSC using.

Illustration of the Pandas describe and transcribe methods.

Image credit: Author’s illustration built in Canva.

If you use pd.describe() you know that it produces summary statistics. So, the problem with pd.describe() is that it puts the variable names across the columns of the summary statistics table.

If you have many variables the table is unreadable… it will be too wide for the screen. To fix that, chain an additional method thus: pd.describe().transpose() for the win!

Aggregate Methods On IsNull

In his presentations on Scikit Learn, Corey Wade offers multiple examples of this clever way to chain methods after isnull().

When inspecting a data frame for missing values you are likely familiar with the pd.isnull() method. That method returns a copy of the data frame but where each value is True or False. True when missing. False, otherwise.

Several clever additional method chains make this output more readable. For example, perhaps you already knew about the df.isnull().sum() hack:

# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# View which variables have missing values, and how many.
df.isnull().sum()

But did you know about the double sum? Using pd.insull().sum().sum() will give the total missing values (or entries) across the entire data frame.

# View how many missing values there are in the entire df.
df.isnull().sum().sum()

A less intuitive hack and result is taking the mean of boolean values. The mean of a boolean is equivalent to the proportion true.

Thus, for more useful output, expressing the “missingness “as a proportion of the observations is helpful. The df.isnull().mean() method chain does this for you:

# View the proportion of values missing in each column.
df.isnull().mean() * 100# Use lambda to spruce up the output.
df.isnull().mean().apply(lambda x:
                         str(round(x * 100, 1)) +
                         "% Missing Values")

Quality Assurance Environmental Checks

Making sure your environment is ready to go, at the time of development, and then on through subsequent runs, is important. Stefanie Molin demonstrated a few hacks that do this well. She shared her implementation at her presentation on data visualization.

Standout Code Snippets From ODSC West 2022

Image Credit: From Stefanie Molin’s data visualization in the Python repository.

As you can see from the image above, taken from Stefanie’s repository, this hack produces elegant and readable output. Placing this code in any repository will be a handy way to add a greater measure of quality assurance for you and those with whom you collaborate.

Conclusion

This article summarized some of the best snippits of code shared during ODSC’s West 2022 session in San Francisco. Specific examples included code that presents data visualziations side-by-side, the use of aggregate methods following the pd.isnull() method, and also coding conventions that can check and re-check for proper environmental configuration.

If you missed ODSC West in San Francisco you should consider future editions. In just a few short weeks, information and registration for ODSC East in Boston will be available.

Thanks For Reading

Adam Ross Nelson is a data scientist + career coach. Read more about advancing your data science career: coaching.adamrossnelson.com.

Thanks for reading. Send me your thoughts and ideas. You can write just to say hey. And if you really need to tell me how I got it wrong, I look forward to chatting soon. Twitter: @adamrossnelson | LinkedIn: Adam Ross Nelson| Facebook: Adam Ross Nelson.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1