

Convert Pandas Categorical Data for SciKit-Learn
Tools & LanguagesWorkflowposted by Damian Mingle July 25, 2018 Damian Mingle

As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do not bring it into their models. That is certainly an option, however many times the categorical data represents information that we would typically want to bring in to these scenarios.
Examples of values which may be represented in a categorical way:
- Political party: Democratic, Republican, Independent
- Religious affiliation: Christianity, Hinduism, Buddism
- Retail departments: shoes, apparel, home goods
- Property styles: Bungalow, Bi-level, 2-story
While there are several algorithms which can automatically handle categorical and numerical values with virtually no pre-processing. Different algorithms require your categorical data to be converted to numerical values.
title | author | date |
---|---|---|
Convert Pandas Categorical Data For SciKit-Learn
|
Damian Mingle
|
06/08/2018
|
Preliminaries
# Bring in libraries
from sklearn import preprocessing
import pandas as pd
Construct a DataFrame
# Create the data
raw_data = {'clinical_trial': [1, 2, 1, 2, 2],
'observation': [1, 2, 3, 1, 1],
'protocol': [0, 1, 0, 1, 0],
'outcome': ['excellent', 'poor', 'normal', 'poor', 'excellent']}
# Fill the DataFrame
df = pd.DataFrame(raw_data, columns = ['clinical_trial', 'observation', 'protocol', 'outcome'])
Fit The Label Encoder
# Create a label encoder object
le = preprocessing.LabelEncoder()
# Fit the encoder object (le) to a pandas field with categorical data
le.fit(df['outcome'])
LabelEncoder()
View The Labels
# Display labels
list(le.classes_)
['excellent', 'normal', 'poor']
Transform Categories Into Integers
# Apply the label encoder object to a pandas column
le.transform(df['outcome'])
array([0, 2, 1, 2, 0], dtype=int64)
Transform Integers Into Categories
# Reverse numerical values into categorical names
list(le.inverse_transform([2, 0, 2]))
['poor', 'excellent', 'poor']