The Beginner’s Guide to Scikit-Learn
PythonTools & LanguagesScikit-Learnposted by Spencer Norris, ODSC October 24, 2018 Spencer Norris, ODSC
Scikit-Learn is one of the premier tools in the machine learning community, used by academics and industry professionals alike.
At ODSC East 2019, Scikit-Learn author Andreas Mueller will host a training session to give beginners a crash course. As one of the primary contributors to Scikit-Learn, Mueller is one of the most knowledgeable people in the world on the package and among the best to derive insight from.
That’s hard to top, but it would be useful to know a little bit about Scikit-Learn before attending the conference. We’ll do a quick outline of how some of SK-L’s basic components fit together and how you can begin training models today.
What are We Trying to Learn?
The most important thing to figure out from the get-go is what we’re actually trying to learn. Do we want to classify our data into different categories? Fit a regression line? Cluster the data? This will drive what model you decide to run with.
For our example, to gain insight on the relationship between wine prices and reviews, we’ll use a ridge regression model (a regularized version of linear regression). I pulled down a scraped collection of reviews from Wine Enthusiast, available via Kaggle. I hope to determine whether there is a connection between the price of the wine and the score it received from Wine Enthusiast’s critics. Let’s find out!
The first thing we’ll have to do is pull in the data, plus clean out any rows we’re not interested in working with (e.g. blank values). Then we’ll normalize our data to a 0 to 1 interval for simplicity in the training process.
import pandas as pd #Read in data data = pd.concat([ pd.read_csv('data/winemag-data_first150k.csv'), pd.read_csv('data/winemag-data-130k-v2.csv') ], ignore_index=True) #Clean data, pare down data = data.loc[ (~data['price'].isnull()) & (~data['points'].isnull()) ][['price', 'points']] #Normalize data on 0 to 1 interval data['price'] = (data['price'] - data['price'].min() + .001) / (data['price'].max() - data['price'].min()) data['points'] = (data['points'] - data['points'].min() + .001) / (data['points'].max() - data['points'].min()) |
Separating out our data into training and testing sets with Scikit-Learn is super easy. This is all you have to do to get a standard 80-20 training-test split with random selection.
from sklearn.model_selection import train_test_split #Train-test split training, test = train_test_split(data, train_size=.8, shuffle=True) |
Now all that’s left to do is train and evaluate our model.
from sklearn.linear_model import RidgeCV import numpy as np #Instantiate, train model model = RidgeCV(alphas=np.arange(0,10,.2), cv=10) model.fit(np.vstack(training['price'].values), np.vstack(training['points'].values)) |
How did we perform?
print(model.score(np.vstack(test['price']), np.vstack(test['points']))) 0.200493125822 |
Horribly! An R-squared value of .20 is pretty bad performance and isn’t something we should deploy in the wild.
The good news is, that’s the whole thing! We managed to train a regularized model with cross-validation and a proper training-test split in 35 lines of Python. Here’s the whole thing.
#!/usr/bin/env python3 from sklearn.linear_model import RidgeCV from sklearn.model_selection import train_test_split import numpy as np import pandas as pd def main(): #Read in data data = pd.concat([ pd.read_csv('data/winemag-data_first150k.csv'), pd.read_csv('data/winemag-data-130k-v2.csv') ], ignore_index=True) #Clean data, pare down data = data.loc[ (~data['price'].isnull()) & (~data['points'].isnull()) ][['price', 'points']] #Normalize data on 0 to 1 interval data['price'] = (data['price'] - data['price'].min() + .001) / (data['price'].max() - data['price'].min()) data['points'] = (data['points'] - data['points'].min() + .001) / (data['points'].max() - data['points'].min()) #Train-test split training, test = train_test_split(data, train_size=.8, shuffle=True) #Instantiate, train model model = RidgeCV(alphas=np.arange(0,10,.2), cv=10) model.fit(np.vstack(training['price'].values), np.vstack(training['points'].values)) print(model.score(np.vstack(test['price']), np.vstack(test['points']))) if __name__ == '__main__': main() |
Not too shabby. And that means that we can iterate quickly and rebuild our model to tune performance as needed. Maybe next time we’d try ridge regression with a nonlinear kernel, or try and see if there are clusters according to price point.
Scikit-Learn is a great way for professionals to accelerate their development process and for rookies to take a stab at machine learning. And at ODSC East, you can learn straight from one of the creators of the package himself.