fbpx
The Beginner’s Guide to Scikit-Learn The Beginner’s Guide to Scikit-Learn
Scikit-Learn is one of the premier tools in the machine learning community, used by academics and industry professionals alike. At ODSC... The Beginner’s Guide to Scikit-Learn

Scikit-Learn is one of the premier tools in the machine learning community, used by academics and industry professionals alike. At ODSC East 2019, Scikit-Learn author Andreas Mueller will host a training session to give beginners a crash course —this is your guide to scikit-learn. As one of the primary contributors to Scikit-Learn, Mueller is one of the most knowledgeable people in the world on the package and among the best to derive insight from.

[Related Article: Watch: Introduction to Machine Learning with Scikit-Learn]

That’s hard to top, but it would be useful to know a little bit about Scikit-Learn before attending the conference. We’ll do a quick outline of how some of SK-L’s basic components fit together and how you can begin training models today.

What are We Trying to Learn?

The most important thing to figure out from the get-go is what we’re actually trying to learn. Do we want to classify our data into different categories? Fit a regression line? Cluster the data? This will drive what model you decide to run with.

For our example, to gain insight on the relationship between wine prices and reviews, we’ll use a ridge regression model (a regularized version of linear regression). I pulled down a scraped collection of reviews from Wine Enthusiast, available via Kaggle. I hope to determine whether there is a connection between the price of the wine and the score it received from Wine Enthusiast’s critics. Let’s find out!

The first thing we’ll have to do is pull in the data, plus clean out any rows we’re not interested in working with (e.g. blank values). Then we’ll normalize our data to a 0 to 1 interval for simplicity in the training process.

 

import pandas as pd
#Read in data
data = pd.concat([
            pd.read_csv('data/winemag-data_first150k.csv'),
            pd.read_csv('data/winemag-data-130k-v2.csv')
            ], ignore_index=True)

#Clean data, pare down
data = data.loc[
(~data['price'].isnull()) &
            (~data['points'].isnull())
            ][['price', 'points']]

#Normalize data on 0 to 1 interval
data['price'] = (data['price'] - data['price'].min() + .001)
            / (data['price'].max() - data['price'].min())
data['points'] = (data['points'] - data['points'].min() + .001)
            / (data['points'].max() - data['points'].min())

 

Separating out our data into training and testing sets with Scikit-Learn is super easy. This is all you have to do to get a standard 80-20 training-test split with random selection.

 

from sklearn.model_selection import train_test_split
#Train-test split
training, test = train_test_split(data, train_size=.8, shuffle=True)

 

Now all that’s left to do is train and evaluate our model.

 

from sklearn.linear_model import RidgeCV
import numpy as np
#Instantiate, train model
model = RidgeCV(alphas=np.arange(0,10,.2), cv=10)
model.fit(np.vstack(training['price'].values),
          np.vstack(training['points'].values))

 

How did we perform?

 

print(model.score(np.vstack(test['price']), np.vstack(test['points'])))
0.200493125822

 

Horribly! An R-squared value of .20 is pretty bad performance and isn’t something we should deploy in the wild.

The good news is, that’s the whole thing! We managed to train a regularized model with cross-validation and a proper training-test split in 35 lines of Python. Here’s the whole thing.

 

#!/usr/bin/env python3
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

def main():

        #Read in data
        data = pd.concat([
            pd.read_csv('data/winemag-data_first150k.csv'),
            pd.read_csv('data/winemag-data-130k-v2.csv')
        ], ignore_index=True)

        #Clean data, pare down
        data = data.loc[
            (~data['price'].isnull()) &
            (~data['points'].isnull())
        ][['price', 'points']]

        #Normalize data on 0 to 1 interval
        data['price'] = (data['price'] - data['price'].min() + .001)
                        / (data['price'].max() - data['price'].min())
        data['points'] = (data['points'] - data['points'].min() + .001)
                        / (data['points'].max() - data['points'].min())

        #Train-test split
        training, test = train_test_split(data, train_size=.8, shuffle=True)

        #Instantiate, train model
        model = RidgeCV(alphas=np.arange(0,10,.2), cv=10)
        model.fit(np.vstack(training['price'].values),
                  np.vstack(training['points'].values))

        print(model.score(np.vstack(test['price']),
                          np.vstack(test['points'])))

if __name__ == '__main__':
        main()

Not too shabby. And that means that we can iterate quickly and rebuild our model to tune performance as needed. Maybe next time we’d try ridge regression with a nonlinear kernel, or try and see if there are clusters according to price point.

[Related Article: Exploring Scikit-Learn Further: The Bells and Whistles of Preprocessing]

Scikit-Learn is a great way for professionals to accelerate their development process and for rookies to take a stab at machine learning. Use this Guide to Scikit-Learn, and at ODSC East, you can learn straight from one of the creators of the package himself.

Spencer Norris, ODSC

Spencer Norris is a data scientist and freelance journalist. He currently works as a contractor and publishes on his blog on Medium: https://medium.com/@spencernorris

1