# Techniques for Transforming Categorical Data into Numerical Values

Transforming categorical data into numerical values is a common data munging task on data science projects.

The most common process is to *One Hot Encode* the categories, i.e. to add boolean features for each of the categorical values.

For instance, for gender – therefore three categories (M, F, Unknown) – the gender feature would be replaced by two binary columns: *Male* and *Female*, the third column being inferred in the case where *Male* and *Female* are both False.

There are, however, many other more complex techniques.

In this blog post Will McGinnis, senior architect, at Predikto goes through each of them: Backward Difference, Helmert Contrast, Simple Hashing, Polynomial Contrast. He also compares the performance of a Scikit-learn BernoulliNB() classifier on several datasets from the UCI dataset repository and shows significant improvement in using some methods over others. A must-read for the Kaggle practitioner.