fbpx
Creating if/elseif/else Variables in Python/Pandas Creating if/elseif/else Variables in Python/Pandas
Creating if/elseif/else Variables in Python/Pandas

Summary: This blog demos Python/Pandas/Numpy code to manage the creation of Pandas dataframe attributes with if/then/else logic. It contrasts five approaches for conditional variables using a combination of Python, Numpy, and Pandas features/techniques. An intermediate level of Python/Pandas programming sophistication is assumed of readers.

The data used to showcase the code revolves on the who, what, where, and when of Chicago crime from 2001 to the present, made available a week in arrears. The data set is, alas, quite large, with over 7M crime records and in excess of 20 attributes. I’ve been working with Chicago crime data in both R/data.table and Python/Pandas for more than five years, and have processes in place to download/enhance the data daily.

My self-directed task for this blog was to load the latest enhanced data using the splendid feather library for interoperating between R and Pandas dataframes, and then to examine different techniques for creating a new “season” attribute determined by the month of year. When coding in Pandas, the programmer has Pandas, native Python, and Numpy techniques at her disposal. Of the five methods outlined, the first two are functional Pandas, the third is Numpy, the fourth is pure Pandas, and the fifth deploys a second Numpy function. In the end, I prefer the fifth option for both flexibility and performance.

The technology used is Wintel 10 along with JupyterLab 1.2.4 and Python 3.7.5, plus foundation libraries Pandas 0.25.3 and Numpy 1.16.4.

Load a personal functions library. blanks, metadf, and freqsdf, a general-purpose frequencies procedure, are used here.

In [1]:
import sys

functdir = "c:/steve/jupyter/notebooks/functions"
sys.path.append(functdir)

from newmyfuncs import *

blanks(2)

 

Next, import pertinent Python libraries.

 

In [2]:
import warnings, os, numpy as np, pandas as pd, time, datetime, re, string, random, math, feather

warnings.filterwarnings('ignore')

blanks(2)

 

Load a previously constituted Chicago crime data file consisting of over 7M records and 20+ attributes. The feather file used was written by an R script run earlier. Compute year, month, day, and hour integers from a date field. Downcast 64 bit floats and ints to 32.

 

In [3]:
start = time.time()

wdir = "c:/bigdata/raw/chicago"
os.chdir(wdir)

ifile = "crimeplus.feather"
chicagocrime = feather.read_dataframe(ifile).reset_index(drop=True)

chicagocrime.eval("""
year = date.dt.year
month = date.dt.month
day = date.dt.day
hour = date.dt.hour
""",inplace=True)

for c in enumerate(chicagocrime.dtypes) : 
    if c[1] == "float64" : 
        print (chicagocrime.columns[c[0]])
        chicagocrime[chicagocrime.columns[c[0]]] =\
            pd.to_numeric(chicagocrime[chicagocrime.columns[c[0]]], downcast='float')

blanks(1)

for c in enumerate(chicagocrime.dtypes) : 
    if c[1] == "int64" : 
        print (chicagocrime.columns[c[0]])
        chicagocrime[chicagocrime.columns[c[0]]] =\
            pd.to_numeric(chicagocrime[chicagocrime.columns[c[0]]], downcast='float')

print("\n",chicagocrime.date.max(),"\n")
metadf(chicagocrime,data=False)

end = time.time()
print(end - start)

blanks(2)
district
ward
communityarea
xcoordinate
ycoordinate
latitude
longitude
population

year
month
day
hour

 2020-01-20 23:55:00+00:00 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7057215 entries, 0 to 7057214
Data columns (total 31 columns):
id                     int32
casenumber             object
date                   datetime64[ns, UTC]
block                  object
iucr                   object
primarytype            object
description            object
locationdescription    object
arrest                 bool
domestic               bool
beat                   int32
district               float32
ward                   float32
communityarea          float32
fbicode                object
xcoordinate            float32
ycoordinate            float32
year                   float32
updatedon              datetime64[ns, UTC]
latitude               float32
longitude              float32
fbicodedesc            object
name                   object
population             float32
h_tf                   bool
v_tf                   bool
p_tf                   bool
i_tf                   bool
month                  float32
day                    float32
hour                   float32
dtypes: bool(6), datetime64[ns, UTC](2), float32(12), int32(2), object(9)
memory usage: 1009.5+ MB
None

9.331389427185059

Compute a series of identical “season” attributes based on month from the chicagocrime dataframe using a variety of methods.

 

1) First up, Pandas apply/map with a native Python function call. Note that Python has no “case” statement, but does support a general if/then/elseif/else construct.

In [4]:
start = time.time()

def mkseason(n):
    if 3<=n<=5: season = 'spring'
    elif 6<=n<=8: season = 'summer'
    elif 9<=n<=11: season = 'fall'
    elif n<=2 or n==12: season = 'winter'
    else: season = 'unknown'
    return(season)

chicagocrime['season_1'] = chicagocrime['month'].apply(mkseason)

end = time.time()
print("\n",end - start)
blanks(2)
 1.964857816696167

2) Next, Pandas apply/map invoking a Python lambda function. This approach doesn’t implement elseif directly, but rather through nested else’s.

 

In [5]:
start = time.time()

chicagocrime['season_2'] = chicagocrime.month.map(\
lambda x:'spring' if 3<=x<=5 else \
('summer' if 6<=x<=8 else\
('fall' if 9<=x<=11 else\
('winter' if 1<=x<=2 or x==12 else\
'unknown'))))

end = time.time()
print("\n",end - start)
blanks(2)
 1.9042260646820068

3) Now consider the Numpy where function with nested else’s similar to the above.

 

In [6]:
start = time.time()

chicagocrime['season_3'] =\
np.where((chicagocrime.month>=3) & (chicagocrime.month<=5),'spring',\
np.where((chicagocrime.month>=6) & (chicagocrime.month<=8),'summer',\
np.where((chicagocrime.month>=9) & (chicagocrime.month<=11),'fall',\
np.where((chicagocrime.month==12) | (chicagocrime.month<=2),'winter',\
'unknown'))))

end = time.time()
print("\n",end - start)
blanks(2)
 1.519035816192627

4) Native Pandas. Start with ‘unknown’ and progressively update.

 

In [7]:
start = time.time()

chicagocrime['season_4'] = 'unknown'
chicagocrime.loc[(chicagocrime.month>=3) & (chicagocrime.month<=5),'season_4'] = 'spring' 
chicagocrime.loc[(chicagocrime.month>=6) & (chicagocrime.month<=8),'season_4'] = 'summer' 
chicagocrime.loc[(chicagocrime.month>=9) & (chicagocrime.month<=11),'season_4'] = 'fall' 
chicagocrime.loc[(chicagocrime.month<=2) | (chicagocrime.month==12),'season_4'] = 'winter' 

end = time.time()
print("\n",end - start)
blanks(2)
 2.442988872528076

5) Finally, the Numpy select function. This one implements elseif’s naturally, with a default case to handle “else”.

 

In [8]:
start = time.time()

condlist = [(chicagocrime.month>=3)&(chicagocrime.month<6),            
            (chicagocrime.month>=6)&(chicagocrime.month<=8),
            (chicagocrime.month>=9)&(chicagocrime.month<=11),
            (chicagocrime.month<=2)|(chicagocrime.month==12)
            ]
choicelist = ['spring', 'summer','fall','winter']

chicagocrime['season_5'] = np.select(condlist, choicelist, default='unknown')

end = time.time()
print("\n",end - start)
blanks(2)
 1.2007920742034912

All five season versions equal? Yes.

 

In [9]:
start = time.time()

print(chicagocrime.season_1.equals(chicagocrime.season_2))
print(chicagocrime.season_1.equals(chicagocrime.season_3))
print(chicagocrime.season_1.equals(chicagocrime.season_4))
print(chicagocrime.season_1.equals(chicagocrime.season_5))

end = time.time()
print("\n",end - start)
blanks(2)
True
True
True
True

 2.446195602416992

Show the newly-created season vars in action with frequencies of crime type. Speedy.

 

In [10]:
start = time.time()

vars = ['season_1']
print("homicide\n",freqsdf(chicagocrime[chicagocrime.h_tf][vars],vars))
blanks(1)
vars = ['season_2']
print("violent crime\n",freqsdf(chicagocrime[chicagocrime.v_tf][vars],vars))
blanks(1)
vars = ['season_3']
print("property crime\n",freqsdf(chicagocrime[chicagocrime.p_tf][vars],vars))
blanks(1)
vars = ['season_4']
print("index crime\n",freqsdf(chicagocrime[chicagocrime.i_tf][vars],vars))
blanks(1)
vars = ['season_5']
print("all crime\n",freqsdf(chicagocrime[vars],vars))

end = time.time()
print("\n",end - start)
blanks(2)
homicide
   season_1  frequency    percent
0   summer       3133  31.298701
1     fall       2606  26.033966
2   spring       2359  23.566434
3   winter       1912  19.100899

violent crime
   season_2  frequency    percent
0   summer     176791  28.908911
1     fall     157590  25.769158
2   spring     148949  24.356180
3   winter     128215  20.965751

property crime
   season_3  frequency    percent
0   summer     621618  27.893021
1     fall     580602  26.052565
2   spring     530382  23.799111
3   winter     495977  22.255303

index crime
   season_4  frequency    percent
0   summer     798409  28.111766
1     fall     738192  25.991541
2   spring     679331  23.919061
3   winter     624192  21.977632

all crime
   season_5  frequency    percent
0   summer    1928259  27.323229
1   spring    1785038  25.293802
2     fall    1775968  25.165281
3   winter    1567950  22.217688

 4.422274827957153

How do the five conditional variable creation approaches stack up?

Much as I’d like to recommend 1) or 2) for their functional inclinations, I’m hestitant. While performance is very good when a single attribute, in this case month, is used, it degrades noticeably when multiple attributes are involved in the calculation, as is often the case.

And 3) shares the absence of pure elseif affliction with 2), while 4) seems a bit clunky and awkward.

That leaves 5), the Numpy select, as my choice. It’s simple, handles elseif’s cleanly, and is generally performant, even with multiple attributes — as the silly code below demonstrates.

 

In [11]:
start = time.time()

condlist = [((chicagocrime.season_5=="summer")&(chicagocrime.year.isin([2012,2013,2014,2015]))), 
            ((chicagocrime.season_5=="summer")&(chicagocrime.year.isin([2016,2017,2018,2019])))
            ]
choicelist = ['summerold', 'summerrecent']

chicagocrime['slug'] = np.select(condlist,choicelist,'unknown')

end = time.time()
print("\n",end - start)
blanks(2)
 1.9020872116088867

Lastly, view several sets of frequencies with this newly-created attribute using the Pandas query method.

 

In [12]:
start = time.time()

vars = ['slug']
print("\nhomicides\n",freqsdf(chicagocrime.query('h_tf')[vars],vars))
print("\n property crimes\n",freqsdf(chicagocrime.query('p_tf'),vars))

end = time.time()
print("\n",end - start)
blanks(2)
homicides
            slug  frequency    percent
0       unknown       8649  86.403596
1  summerrecent        783   7.822178
2     summerold        578   5.774226

 property crimes
            slug  frequency    percent
0       unknown    2023420  90.794179
1     summerold     108298   4.859509
2  summerrecent      96861   4.346312

 2.665894031524658

That’s it for now. More on data handling/analysis in Python/Pandas and R/data.table in blogs to come.

In [ ]:

Steve Miller

Steve Miller

Steve is a data scientist focusing on research methods, data integration, statistics/ML, and computational analytics in R/python. Follow him on LinkedIn: https://www.linkedin.com/in/steve-miller-58ab881/

1