fbpx
Transforming Skewed Data for Machine Learning Transforming Skewed Data for Machine Learning
Skewed data is common in data science; skew is the degree of distortion from a normal distribution. For example, below is... Transforming Skewed Data for Machine Learning

Skewed data is common in data science; skew is the degree of distortion from a normal distribution. For example, below is a plot of the house prices from Kaggle’s House Price Competition that is right skewed, meaning there are a minority of very large values.

Why do we care if the data is skewed? If the response variable is skewed like in Kaggle’s House Prices Competition, the model will be trained on a much larger number of moderately priced homes, and will be less likely to successfully predict the price for the most expensive houses. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance.

https://odsc.com/europe/

We can objectively determine if the variable is skewed using the Shapiro-Wilks test. The null hypothesis for this test is that the data is a sample from a normal distribution, so a p-value less than 0.05 indicates significant skewness. We’ll apply the test to the response variable Sale Price above labeled “resp” using Scipy.stats in Python.

The p-value is not surprisingly less than 0.05, so we can conclude that the variable is skewed. A more convenient way of evaluating skewness is with pandas’ “.skew” method. It calculates the Fisher–Pearson standardized moment coefficient for all columns in a dataframe. We can calculate it for all the features in Kaggle’s Home Value dataset (labeled “df”) simultaneously with the following code.

 

A few of the variables like Pool Area are highly right skewed due to lots of zeros, this is okay. Some models like decision trees are fairly robust to skewed features.

We can address skewed variables by transforming them (i.e. applying the same function to each value). Common transformations include square root (sqrt(x)), logarithmic (log(x)), and reciprocal (1/x). We’ll apply each in Python to the right-skewed response variable Sale Price.

Square Root Transformation

After transforming, the data is definitely less skewed, but there is still a long right tail.

Reciprocal Transformation

Still not great, the above distribution is not quite symmetrical.

Log Transformation

The log transformation seems to be the best, as the distribution of transformed sale prices is the most symmetrical.

Box Cox Transformation

An alternative to manually trying a variety of transformations is the Box Cox transformation. For each variable, a Box Cox transformation estimates the value lambda from -5 to 5 that maximizes the normality of the data using the equation below.

 

For negative values of lambda, the transformation performs a variant of the reciprocal of the variable. At a lambda of zero, the variable is log transformed, and for positive lambda values, the variable is transformed the power of lambda. We can apply “boxcox” to all the skewed variables in the dataframe “df” using Scipy.stats.

Skewness reduced quite a bit! The box cox transformation is not a panacea for skew however; some variables cannot be transformed to be normally distributed.

Transforming skewed data is one critical step during the data cleaning process. See this article to learn about dealing with imbalanced categorical classes.


Learn more about machine learning platforms and skills at ODSC East 2022

We just listed off quite a few machine learning engineering platforms, skills, and frameworks. It’s not expected to know every single thing mentioned above, but knowing a good chunk of them – and how to apply them in business settings – will help you get a job or become better at your current one.

At ODSC Europe 2022, we have an entire track devoted to machine learning and deep learning. Learn ML engineering skills and platforms like the ones listed above. Here are a few sessions scheduled so far:

  • Beyond the Basics: Data Visualization in Python
  • A Hands-on Guide to Machine Learning with TensorFlow
  • Introduction to Machine Learning
  • Rule Induction and Reasoning in Knowledge Graphs
  • Digital Twins: Not All Digital Twins are Identical
  • Time-Series in Python – Preprocessing and Machine Learning
  • The Bayesian Revolution in Online Marketing
  • How to Teach Our World Knowledge to a Neural Network?
  • Dynamic and Context-Dependent Stock Price Prediction Using Attention Modules and News Sentiment
  • Open Source Explainability – Understanding Model Decisions Using Alibi
  • PyTorch 101: Building a Model Step-by-step
  • GANs N’ Roses: Understanding Generative Models
  • How to Write a Scikit-learn Compatible Estimator
  • Explainability by Design: a Methodology to Support Explanations in Decision-making Systems
  • Diffusion Models for Text-to-Image Generation
  • Visually Inspecting Data Profiles for Data Distribution Shifts
  • Computer Perception Challenges in Drone Applications Using Quality Data Annotation
  • Next-Generation Web Apps: Create a Machine Learning Powered Smart Cam in the Browser with TensorFlow.js
  • Optimizing Your Analytics Life Cycle with Machine Learning and Open Source
  • Machine Learning for Economics and Finance in TensorFlow 2
  • Revealing the Inner Self: Automatic Differentiation (Autodiff) Clearly Explained
  • Moving into the Frequency Domain with the Fourier Transform

Nathaniel Jermain

Nathaniel builds and implements predictive models for a fish research lab at the University of Southern Mississippi. His work informs the management of marine resources in applications across the United States. Connect with Nathaniel on LinkedIn: linkedin.com/in/njermain/

1