

Creating if/elseif/else Variables in Python/Pandas
PythonTools & Languagesposted by Steve Miller March 27, 2020 Steve Miller

Summary: This blog demos Python/Pandas/Numpy code to manage the creation of Pandas dataframe attributes with if/then/else logic. It contrasts five approaches for conditional variables using a combination of Python, Numpy, and Pandas features/techniques. An intermediate level of Python/Pandas programming sophistication is assumed of readers.
The data used to showcase the code revolves on the who, what, where, and when of Chicago crime from 2001 to the present, made available a week in arrears. The data set is, alas, quite large, with over 7M crime records and in excess of 20 attributes. I’ve been working with Chicago crime data in both R/data.table and Python/Pandas for more than five years, and have processes in place to download/enhance the data daily.
My self-directed task for this blog was to load the latest enhanced data using the splendid feather library for interoperating between R and Pandas dataframes, and then to examine different techniques for creating a new “season” attribute determined by the month of year. When coding in Pandas, the programmer has Pandas, native Python, and Numpy techniques at her disposal. Of the five methods outlined, the first two are functional Pandas, the third is Numpy, the fourth is pure Pandas, and the fifth deploys a second Numpy function. In the end, I prefer the fifth option for both flexibility and performance.
The technology used is Wintel 10 along with JupyterLab 1.2.4 and Python 3.7.5, plus foundation libraries Pandas 0.25.3 and Numpy 1.16.4.
Load a personal functions library. blanks, metadf, and freqsdf, a general-purpose frequencies procedure, are used here.
import sys
functdir = "c:/steve/jupyter/notebooks/functions"
sys.path.append(functdir)
from newmyfuncs import *
blanks(2)
Next, import pertinent Python libraries.
import warnings, os, numpy as np, pandas as pd, time, datetime, re, string, random, math, feather
warnings.filterwarnings('ignore')
blanks(2)
Load a previously constituted Chicago crime data file consisting of over 7M records and 20+ attributes. The feather file used was written by an R script run earlier. Compute year, month, day, and hour integers from a date field. Downcast 64 bit floats and ints to 32.
start = time.time()
wdir = "c:/bigdata/raw/chicago"
os.chdir(wdir)
ifile = "crimeplus.feather"
chicagocrime = feather.read_dataframe(ifile).reset_index(drop=True)
chicagocrime.eval("""
year = date.dt.year
month = date.dt.month
day = date.dt.day
hour = date.dt.hour
""",inplace=True)
for c in enumerate(chicagocrime.dtypes) :
if c[1] == "float64" :
print (chicagocrime.columns[c[0]])
chicagocrime[chicagocrime.columns[c[0]]] =\
pd.to_numeric(chicagocrime[chicagocrime.columns[c[0]]], downcast='float')
blanks(1)
for c in enumerate(chicagocrime.dtypes) :
if c[1] == "int64" :
print (chicagocrime.columns[c[0]])
chicagocrime[chicagocrime.columns[c[0]]] =\
pd.to_numeric(chicagocrime[chicagocrime.columns[c[0]]], downcast='float')
print("\n",chicagocrime.date.max(),"\n")
metadf(chicagocrime,data=False)
end = time.time()
print(end - start)
blanks(2)
Compute a series of identical “season” attributes based on month from the chicagocrime dataframe using a variety of methods.
1) First up, Pandas apply/map with a native Python function call. Note that Python has no “case” statement, but does support a general if/then/elseif/else construct.
start = time.time()
def mkseason(n):
if 3<=n<=5: season = 'spring'
elif 6<=n<=8: season = 'summer'
elif 9<=n<=11: season = 'fall'
elif n<=2 or n==12: season = 'winter'
else: season = 'unknown'
return(season)
chicagocrime['season_1'] = chicagocrime['month'].apply(mkseason)
end = time.time()
print("\n",end - start)
blanks(2)
2) Next, Pandas apply/map invoking a Python lambda function. This approach doesn’t implement elseif directly, but rather through nested else’s.
start = time.time()
chicagocrime['season_2'] = chicagocrime.month.map(\
lambda x:'spring' if 3<=x<=5 else \
('summer' if 6<=x<=8 else\
('fall' if 9<=x<=11 else\
('winter' if 1<=x<=2 or x==12 else\
'unknown'))))
end = time.time()
print("\n",end - start)
blanks(2)
3) Now consider the Numpy where function with nested else’s similar to the above.
start = time.time()
chicagocrime['season_3'] =\
np.where((chicagocrime.month>=3) & (chicagocrime.month<=5),'spring',\
np.where((chicagocrime.month>=6) & (chicagocrime.month<=8),'summer',\
np.where((chicagocrime.month>=9) & (chicagocrime.month<=11),'fall',\
np.where((chicagocrime.month==12) | (chicagocrime.month<=2),'winter',\
'unknown'))))
end = time.time()
print("\n",end - start)
blanks(2)
4) Native Pandas. Start with ‘unknown’ and progressively update.
start = time.time()
chicagocrime['season_4'] = 'unknown'
chicagocrime.loc[(chicagocrime.month>=3) & (chicagocrime.month<=5),'season_4'] = 'spring'
chicagocrime.loc[(chicagocrime.month>=6) & (chicagocrime.month<=8),'season_4'] = 'summer'
chicagocrime.loc[(chicagocrime.month>=9) & (chicagocrime.month<=11),'season_4'] = 'fall'
chicagocrime.loc[(chicagocrime.month<=2) | (chicagocrime.month==12),'season_4'] = 'winter'
end = time.time()
print("\n",end - start)
blanks(2)
5) Finally, the Numpy select function. This one implements elseif’s naturally, with a default case to handle “else”.
start = time.time()
condlist = [(chicagocrime.month>=3)&(chicagocrime.month<6),
(chicagocrime.month>=6)&(chicagocrime.month<=8),
(chicagocrime.month>=9)&(chicagocrime.month<=11),
(chicagocrime.month<=2)|(chicagocrime.month==12)
]
choicelist = ['spring', 'summer','fall','winter']
chicagocrime['season_5'] = np.select(condlist, choicelist, default='unknown')
end = time.time()
print("\n",end - start)
blanks(2)
All five season versions equal? Yes.
start = time.time()
print(chicagocrime.season_1.equals(chicagocrime.season_2))
print(chicagocrime.season_1.equals(chicagocrime.season_3))
print(chicagocrime.season_1.equals(chicagocrime.season_4))
print(chicagocrime.season_1.equals(chicagocrime.season_5))
end = time.time()
print("\n",end - start)
blanks(2)
Show the newly-created season vars in action with frequencies of crime type. Speedy.
start = time.time()
vars = ['season_1']
print("homicide\n",freqsdf(chicagocrime[chicagocrime.h_tf][vars],vars))
blanks(1)
vars = ['season_2']
print("violent crime\n",freqsdf(chicagocrime[chicagocrime.v_tf][vars],vars))
blanks(1)
vars = ['season_3']
print("property crime\n",freqsdf(chicagocrime[chicagocrime.p_tf][vars],vars))
blanks(1)
vars = ['season_4']
print("index crime\n",freqsdf(chicagocrime[chicagocrime.i_tf][vars],vars))
blanks(1)
vars = ['season_5']
print("all crime\n",freqsdf(chicagocrime[vars],vars))
end = time.time()
print("\n",end - start)
blanks(2)
How do the five conditional variable creation approaches stack up?
Much as I’d like to recommend 1) or 2) for their functional inclinations, I’m hestitant. While performance is very good when a single attribute, in this case month, is used, it degrades noticeably when multiple attributes are involved in the calculation, as is often the case.
And 3) shares the absence of pure elseif affliction with 2), while 4) seems a bit clunky and awkward.
That leaves 5), the Numpy select, as my choice. It’s simple, handles elseif’s cleanly, and is generally performant, even with multiple attributes — as the silly code below demonstrates.
start = time.time()
condlist = [((chicagocrime.season_5=="summer")&(chicagocrime.year.isin([2012,2013,2014,2015]))),
((chicagocrime.season_5=="summer")&(chicagocrime.year.isin([2016,2017,2018,2019])))
]
choicelist = ['summerold', 'summerrecent']
chicagocrime['slug'] = np.select(condlist,choicelist,'unknown')
end = time.time()
print("\n",end - start)
blanks(2)
Lastly, view several sets of frequencies with this newly-created attribute using the Pandas query method.
start = time.time()
vars = ['slug']
print("\nhomicides\n",freqsdf(chicagocrime.query('h_tf')[vars],vars))
print("\n property crimes\n",freqsdf(chicagocrime.query('p_tf'),vars))
end = time.time()
print("\n",end - start)
blanks(2)
That’s it for now. More on data handling/analysis in Python/Pandas and R/data.table in blogs to come.