Exploring Scikit-Learn Further: The Bells and Whistles of Preprocessing
PythonTools & LanguagesMachine LearningmodelpreprocessingPythonScikit-Learnposted by Spencer Norris, ODSC October 25, 2018 Spencer Norris, ODSC
In my previous post, we constructed a simple cross-validated regression model using Scikit-Learn in 35 lines.
It’s pretty amazing that we can perform machine learning with so little effort, but we just did the bare minimum in order to get a working model. Frankly, it didn’t even perform that well. What else can we do with Scikit-Learn to help take our work to the next level?
[Related Article: From Pandas to Scikit-Learn — A New Exciting Workflow]
Andreas Mueller, one of the primary contributors to Scikit-Learn, will give detail more on his tool and walk you through some of the finer details of the package at ODSC West 2018. In the meantime, let’s whet our appetites and explore how Scikit-Learn allows us to clean and preprocess our data before we build our model.
Clean Up Your Data
Scikit-Learn is an end-to-end solution for many of the most common machine learning problems. That includes how we clean our data prior to training.
Data normalization is one of the most common preprocessing steps to get data in a form where algorithms can leverage it. This was always a manual step in my workflow. I would take a few long, messy lines and helper functions to manipulate a data frame in Pandas and get my features on a scale of 0 to 1. Then I happened across this example in the Scikit-Learn documentation:
from sklearn import preprocessing >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> X_normalized = preprocessing.normalize(X, norm='l2') >>> X_normalized array([[ 0.40..., -0.40..., 0.81...], [ 1. ..., 0. ..., 0. ...], [ 0. ..., 0.70..., -0.70...]])
Maybe that won’t knock your socks off, but it’s game-changing that this tedious process is completely automated in Scikit-Learn.
Data standardization is another key stepping stone in data preprocessing that is applicable to many machine learning algorithms. According to the Scikit-Learn documentation, this is also a cinch:
>>> from sklearn import preprocessing
>>> import numpy as np >>> X_train = np.array([[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]]) >>> X_scaled = preprocessing.scale(X_train) >>> X_scaled array([[ 0. ..., -1.22..., 1.33...], [ 1.22..., 0. ..., -0.26...], [-1.22..., 1.22..., -1.06...]])
Not too bad! Our data is now usable with complex models like the RBF kernel support vector machine, which has important assumptions regarding the variance in the data that standardization accounts for.
Making Your Data ‘Smaller’
A common problem with machine learning in practice is high dimensionality in your dataset. When you’re measuring lots of different things, how do we know which features are most important? Which columns are going to tell us the most about our data?
An extreme example of this is at CERN, where scientists reportedly collect a gigabit per second of a huge variety of different measures. In other words, their data isn’t just ‘tall’ (lots of examples), it’s also very, very ‘wide’ (lots of columns).
To cope with this problem, we can use dimensionality reduction techniques in order to make the data more manageable for our models. Scikit-Learn gives us the equipment to do that both simplistically and efficiently.
(And, fun fact, CERN actually uses Scikit-Learn in some of their software.)
[Related Article: Optimizing Hyperparameters for Random Forest Algorithms in scikit-learn]
One of the most common methods for dimensionality reduction is principal component analysis, which I discussed in a previous post. We don’t have the time to actually write a PCA analysis, though, so why not let Scikit-Learn do the heavy lifting?
From the documentation:
>>> import numpy as np >>> from sklearn.decomposition import PCA >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> pca = PCA(n_components=2) >>> pca.fit(X) PCA(copy=True, iterated_power='auto', n_components=2, random_state=None, svd_solver='auto', tol=0.0, whiten=False) >>> print(pca.explained_variance_ratio_) [0.9924... 0.0075...] >>> print(pca.singular_values_) [6.30061... 0.54980...]
Amazing. We can reduce the number of columns in a disciplined way, discovering new features that describe the data almost as well — all with just two lines of actual code.
Scikit-Learn has all kinds of amazing interfaces like this for building models and developing your pipeline in no time. Be sure to check out Mueller’s talk at ODSC West 2018 for a tour of his amazing tool.