### 5. Putting it all to...

5.1. Pipelining We have seen that some estimators can transform data, and some estimators can predict variables. We can create combined estimators: >>> from scikits.learn import linear_model, decomposition, datasets >>> logistic = linear_model.LogisticRegression() >>> pca = decomposition.PCA() >>> from scikits.learn.pipeline import Pipeline >>> pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)]) >>> digits = datasets.load_digits() >>> X_digits […]

### Beyond One-hot: an E...

In machine learning, data is king. The algorithms and models used to make predictions with the data are important, and very interesting, but ML is still subject to the idea of garbage-in-garbage-out. With that in mind, let’s look at a little subset of those input data: categorical variables. Categorical variables (wiki) are those that represent a […]

### 4. Unsupervised Lear...

4.1. Clustering: grouping observations together The problem solved in clustering Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label them: we could try a clustering task: split the observations in well-separated group called clusters. 4.1.1. K-means clustering Note that there exists many […]

### 3. Model Selection: ...

3.1. Score, and cross-validated scores As we have seen, every estimator exposes a score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better. >>> from scikits.learn import datasets, svm >>> digits = datasets.load_digits() >>> X_digits = digits.data >>> y_digits = digits.target >>> svc = svm.SVC() >>> […]

### 2. Supervised Learni...

The problem solved in supervised learning Supervised learning consists in learning the link between two datasets: the observed data X, and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length n_samples. All supervised estimators in the scikit-learn implement a fit(X, y) method to fit […]

### 1. Statistical Learn...

1.1. Datasets The scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis. A simple example shipped with the […]

### Workflows in Python ...

The last two posts in this series have been about getting a data science analysis quickly up and running, and then circling back to improve it or understand the patterns I find, for example, which algorithms are working best and why. The upshot was a better handle on my workflow, but I’m left with a […]

### Workflows in Python ...

This is the second post in a series about end-to-end data analysis in Python using scikit-learn Pipeline and GridSearchCV. In the first post, I got my data formatted for machine learning by encoding string features as integers, and then used the data to build several different models. I got things running really fast, which is […]

### Workflows in Python ...

So, I had the opportunity to host a workshop at the Open Data Science Conference in San Francisco. During the workshop, I shared the process of rapid prototyping followed by iterating on the model I’ve built. When I’m building a machine learning model in scikit-learn, I usually don’t know exactly what my final model will […]