fbpx
Convert Pandas Categorical Data for SciKit-Learn Convert Pandas Categorical Data for SciKit-Learn
As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do... Convert Pandas Categorical Data for SciKit-Learn

As you encounter various data elements you should come across categorical data. Some individuals simply discard this data in their analysis or do not bring it into their models. That is certainly an option, however many times the categorical data represents information that we would typically want to bring in to these scenarios.

Examples of values which may be represented in a categorical way:

  • Political party: Democratic, Republican, Independent
  • Religious affiliation: Christianity, Hinduism, Buddism
  • Retail departments: shoes, apparel, home goods
  • Property styles: Bungalow, Bi-level, 2-story

While there are several algorithms which can automatically handle categorical and numerical values with virtually no pre-processing. Different algorithms require your categorical data to be converted to numerical values.

title author date
Convert Pandas Categorical Data For SciKit-Learn
Damian Mingle
06/08/2018

Preliminaries

# Bring in libraries
from sklearn import preprocessing
import pandas as pd

Construct a DataFrame

# Create the data
raw_data = {'clinical_trial': [1, 2, 1, 2, 2],
            'observation': [1, 2, 3, 1, 1],
            'protocol': [0, 1, 0, 1, 0],
            'outcome': ['excellent', 'poor', 'normal', 'poor', 'excellent']}

# Fill the DataFrame
df = pd.DataFrame(raw_data, columns = ['clinical_trial', 'observation', 'protocol', 'outcome'])

Fit The Label Encoder

# Create a label encoder object 
le = preprocessing.LabelEncoder()

# Fit the encoder object (le) to a pandas field with categorical data
le.fit(df['outcome'])
LabelEncoder()

View The Labels

# Display labels
list(le.classes_)
['excellent', 'normal', 'poor']

Transform Categories Into Integers

# Apply the label encoder object to a pandas column
le.transform(df['outcome'])
array([0, 2, 1, 2, 0], dtype=int64)

Transform Integers Into Categories

# Reverse numerical values into categorical names
list(le.inverse_transform([2, 0, 2]))
['poor', 'excellent', 'poor']

If you want to better understand kinds of data, take a look at Ian’s video below:


 

Original Source

Damian Mingle

Damian Mingle

Damian Mingle is an American businessman, investor, and data scientist. He is the Founder and Chief Data Scientist of LoveToThink.org, a way for skilled professionals to contribute their expertise and empower the world’s social changemakers. Formerly, Damian was the Chief Data Scientist at Intermedix (an R1 company) where he was responsible for leading a team of international data scientists to drive business value. As a leading authority on data science, Damian speaks nationally and internationally on patient safety, global health, and applied data science.

1