Standout Code Snippets From ODSC West 2022

ConferencesModelingWest 2022posted by ODSC Community November 29, 2022

This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at...

This article brings you up-to-speed on some of the best code snippets you may have missed if you were not at ODSC West 2022.

For this notebook you will need the standard imports:

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns```

Side-By-Side Data Visualizations

This is a common technique you will find in many notebooks across GitHub, Kaggle, and others. However as excellent as this technique is, I wish more would use it.

Clinton Brownley, who presented a tour of machine learning in Python, offers a clear and concise example of the technique that elegantly solves a common use case — when you want to compare multiple distributions side-by-side.

```# Specify fictional data to work with.
r = pd.Series(skewnorm.rvs(a=4, loc=10, scale=4, size=1000))# Specify a four column subplot.
fig, axes = plt.subplots(figsize=(20, 5), ncols=4)sns.distplot(r,          ax=axes[0],
kde=False, rug=False,
fit=stats.norm).set_title('Original')sns.distplot(np.log(r),  ax=axes[1],
kde=False, rug=False,
fit=stats.norm
).set_title('Natural Log')sns.distplot(np.sqrt(r), ax=axes[2],
kde=False, rug=False,
fit=stats.norm
).set_title('Square Root')sns.distplot(1/r,        ax=axes[3],
kde=False, rug=False,
fit=stats.norm).set_title('Inverse')```

Image Credit: Image generated from code snippets shown above. First published by Clinton Brownley.

A variation on this theme lets you view multiple distributions with multiple different vertical y-axes.

```# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# Specify a two by two subplot + adjust spacing.
sns.set_context('notebook')fig, axes = plt.subplots(figsize=(12, 4), ncols=2,
nrows=2, squeeze=False)
ax=axes[0,0],
stat='count',
kde=True, color=my_blue
).set_title('Count')sns.histplot(df['price'],
ax=axes[0,1],
stat='frequency',
kde=True, color=my_blue
).set_title('Frequency')sns.histplot(df['price'],
ax=axes[1,0],
stat='percent',
kde=True, color=my_blue
).set_title('Percent')sns.histplot(df['price'],
ax=axes[1,1],
stat='density',
kde=True, color=my_blue
).set_title('Density')```

Note, to replicate these colors use the pallet strategies I wrote about here. This example also switches the layout from a 1 by 4 to a 2 by 2 display.

Image Credit: Image generated from code snippets shown above.

Transpose The Describe Method

As this is a personal favorite Pandas hack of mine I was glad to see multiple presenters at ODSC using.

Image credit: Author’s illustration built in Canva.

If you use `pd.describe()` you know that it produces summary statistics. So, the problem with `pd.describe()` is that it puts the variable names across the columns of the summary statistics table.

If you have many variables the table is unreadable… it will be too wide for the screen. To fix that, chain an additional method thus: `pd.describe().transpose()` for the win!

Aggregate Methods On IsNull

In his presentations on Scikit Learn, Corey Wade offers multiple examples of this clever way to chain methods after `isnull()`.

When inspecting a data frame for missing values you are likely familiar with the `pd.isnull()` method. That method returns a copy of the data frame but where each value is True or False. True when missing. False, otherwise.

Several clever additional method chains make this output more readable. For example, perhaps you already knew about the `df.isnull().sum()` hack:

```# Load example data.
df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')# View which variables have missing values, and how many.
df.isnull().sum()```

But did you know about the double sum? Using `pd.insull().sum().sum()` will give the total missing values (or entries) across the entire data frame.

```# View how many missing values there are in the entire df.
df.isnull().sum().sum()```

A less intuitive hack and result is taking the mean of boolean values. The mean of a boolean is equivalent to the proportion true.

Thus, for more useful output, expressing the “missingness “as a proportion of the observations is helpful. The `df.isnull().mean()` method chain does this for you:

```# View the proportion of values missing in each column.
df.isnull().mean() * 100# Use lambda to spruce up the output.
df.isnull().mean().apply(lambda x:
str(round(x * 100, 1)) +
"% Missing Values")```

Quality Assurance Environmental Checks

Making sure your environment is ready to go, at the time of development, and then on through subsequent runs, is important. Stefanie Molin demonstrated a few hacks that do this well. She shared her implementation at her presentation on data visualization.

Image Credit: From Stefanie Molin’s data visualization in the Python repository.

As you can see from the image above, taken from Stefanie’s repository, this hack produces elegant and readable output. Placing this code in any repository will be a handy way to add a greater measure of quality assurance for you and those with whom you collaborate.

Conclusion

This article summarized some of the best snippits of code shared during ODSC’s West 2022 session in San Francisco. Specific examples included code that presents data visualziations side-by-side, the use of aggregate methods following the `pd.isnull()` method, and also coding conventions that can check and re-check for proper environmental configuration.

If you missed ODSC West in San Francisco you should consider future editions. In just a few short weeks, information and registration for ODSC East in Boston will be available.