Techniques for Transforming Categorical Data into Numerical Values
Transforming categorical data into numerical values is a common data munging task on data science projects.
The most common process is to One Hot Encode the categories, i.e. to add boolean features for each of the categorical values.
For instance, for gender – therefore three categories (M, F, Unknown) – the gender feature would be replaced by two binary columns: Male and Female, the third column being inferred in the case where Male and Female are both False.
There are, however, many other more complex techniques.
In this blog post Will McGinnis, senior architect, at Predikto goes through each of them: Backward Difference, Helmert Contrast, Simple Hashing, Polynomial Contrast. He also compares the performance of a Scikit-learn BernoulliNB() classifier on several datasets from the UCI dataset repository and shows significant improvement in using some methods over others. A must-read for the Kaggle practitioner.