Ticket prices for ODSC East increasing at 11 PM Friday.




Use code: ODSC20 for extra 20% Off

Riding on Large Data with Scikit-learn

Tags: , , ,

What’s a Large Data Set?

A data set is said to be large when it exceeds 20% of the available RAM for a single machine. Which for your standard MacBook Pro with 8Gb of RAM, corresponds to a meager 2Gb dataset — size that is becoming more and more frequent these days.

Of course before you actually run out of memory, your machine will slow down to a crawl and your frustration will increase in inverse proportions.

To deal with large data, first aid kit strategies consists in sampling the data to only consider a subset of the whole data or reaching out for more RAM by going to the cloud. Amazon offers boxes with plenty of RAM for pennies/hour. Other options are to use libraries such as Apache Spark’s MLlib, or platforms such as H2O or Dato’s GraphLab Create. R also has a streaming package.

However, if scikit-learn is your weapon of choice for machine learning, you should stick with it and make the best of its out-of-core processing capabilities.

Out-of-core with scikit-learn?

Out-of-core processing is simply the method by which the original large data is broken down into smaller sized blocks and the processing is done iteratively on the small blocks. The whole dataset no longer needs to be loaded into memory and the model is trained incrementally on each block.

As you most likely know, scikit-learn is a very popular and very well documented machine learning library of algorithms which is often the subject of first-rate tutorials, books and conference workshops. Scikit-learn is steadily evolving with new models, efficiency improvements on speed and memory, and large data capabilities.

Although scikit-learn is optimized for smaller data, it does offer a decent set of algorithms for out-of-core classification, regression, clustering and decomposition. For classification, scikit-learn implements Multinomial and Bernoulli Naive Bayes, and the following linear models: Perceptron, Stochastic Gradient and the Passive Aggressive classifier.

Improving at scale data capabilities is a long term goal for scikit-learn’s evolution. According to Olivier Grisel, member of the scikit-learn core team:

We are trying to make sure that more and more scikit-learn algorithms can manage data in streaming mode, or out-of-core, rather than by loading the whole dataset in memory. We want them to load the dataset incrementally, as they train the model.

Scikit-learn implements out-of-core learning for these algorithms by making available a partial fit method as a common model API replacing the usual fit method.

Although out-of-core Random Forests are not implemented in scikit-learn, an online version called Mandrian Forests is available in python. Another popular online algorithm is Follow the Regularized Leader out of Google labs with this python implementation.

Complications and Advantages

No longer having the whole training dataset available requires using alternative methods for standard data preparation and feature extraction.

For instance, in the context of text classification with a bag of words approach, it is no longer possible to calculate the TfIdf weights (i.e. counting words frequency with respect to their distribution in the corpus of documents) as this would require a priori knowledge of all the words in the corpus.

Hashing Trick

Instead, the idea is to project the documents unto a reduced dimension vector space. For each new block of documents, each text is vectorized and projected on the same reduced vector space. The a priori knowledge of all the words in the corpus is therefore no longer needed.

The reduced space dimension is set large enough to compensate for the collision effect which occurs when 2 distinct words are projected on the same point in space. This method is known as the Hashing Trick (see this blog post for a great explanation) and is implemented in scikit learn via the HashingVectorizer method.

Out-of-core training also implies that the algorithm possesses a second layer of adaptivity that must be taken into account and tuned.

Not only does the model need to be efficiently trained on each block of new data, it now also must converge on the stream of data blocks. The size of each block becomes a new meta parameter that impacts the performance of the algorithm.

A fast test-assess loop

The good news is that you now have a fast trial and error environment. By putting aside a validation set of the data and using it to score the prediction performance of your model on each new block it is now possible to observe your model performance at each iteration.

You do not have to wait for a large chunk of data to be processed before you can assess the quality of the model. This very effectively speeds up your whole workflow: feature engineering, feature selection, meta parameter optimization, and model selection.

Case Study

To illustrate the out-of-core capabilities of scikit-learn, we consider Kaggle’s Truly Native competition which consisted in predicting whether an html page included ‘native ads‘ or not. Native ads are embedded in the page’s html code as opposed to being loaded from a third party website.

This dataset is large with over 30,000 pages for a total of 36Gb (7.5Gb zipped). The competition was sponsored by Dato to promote the use of the GraphLab create platform which is well suited to handle that type of large dataset. This competition has now ended.

The text-based nature of the dataset and its large size make it a perfect candidate for out-of-core text classification using the the hashing trick and scikit-learn’s out-of-core classification models.

The dataset is strongly unbalanced with only 10% of pages having native ads. To avoid dealing with unbalanced classification sets which in itself requires specific attention, we subsampled the no-ads set to obtain a 50-50 split between pages with and without native ads both for the validation and the training sets. We considered 50,000 pages for the training set and 4,000 for the test set.

Memory usage

The following graph shows the evolution in consumed memory on a MacBook Pro with 8Gb of RAM.
Loading 10k pages and training a standard TfIdf + SGD pipeline had the memory consumption reach over the top.

On the other hand, processing chunks of 100, 500 or 1000 files through a out-of-core flow kept the consumed data well within the available RAM. Note that in this very particular setup the in-memory flow took less time than the out-of-core one. The overhead of the partial_fit method is visible, although probably not the only factor at play.

Fig1 - Memory consumption for In Memory vs out-of-core processing

Model selection

We also compared Multinomial Naive Bayes and linear classifiers: SGD, PassiveAggressive and the Perceptron.

For each iteration we calculated Accuracy and Recall of the model’s prediction with a sample batch size of 500 pages over 100 iterations. Accuracy is shown in the figure below.

The difference in convergence behavior between the Multinomial Naive Bayes classifier and the linear classifiers is striking.

The linear models all show some level of ‘agitation’ even after convergence has been reached. In fact, the curves for the linear models even had to be smoothed to keep the graph readable. On the other hand the Naive Bayes classifier shows a slow but steady and stable convergence. The agitation is even more pronounced for Recall (not shown here).

It’s as if there was a significant portion of the data that was bounced back from being predicted positive to negative by the linear models at nearly each iteration.

Fig2 - Model comparison In Memory vs out-of-core processing

Note: The performance shown here in terms of Accuracy is far from the best scores attained in the Kaggle competition (98-99% AUC). However we only performed some basic feature extraction and used the algorithms default settings for the most part. The case being here to show the feasibility of the out-of-core approach and some sort of convergence and not to win the competition.

Block size

Finally we studied the impact of the block size on the Multinomial Naive Bayes algorithm. The figure below shows the Accuracy (left) and Recall (Right) for batch sizes of 10, 100 and 500 items.
The following can be observed:

  • Small sample size create instability.
  • Recall and Accuracy have very different behavior.
  • Accuracy converges more rapidly and with better performance for higher batch size.
  • Recall, although unstable for a small batch size, shows better performance for a small batch size than for a large one.

Fig3 - Batch size influence on Accuracy and Recall
Note that the total number of samples was limited to 50k which limits the number of iterations to 100 for a 500 samples batch size.

The new batch size parameter does have a strong impact on the performance of the algorithm.


Even with out-of-core enabled models, handling large data on a single machine may still be testing your patience.

Since data munging and feature engineering make up the biggest part of the work, you need to design a workflow that will enable you to try out new ideas fast. Reading the data, scaling it, extracting new features will take time and speeding things up requires a solid mastery of your language of your choice.

Dask, a parallel computing framework that streams data from disk, is currently thought of as a potential way to complement scikit-learn and bring its full out-of-core capabilities. There are discussions in the dask and scikit forums on how to combine the two libraries. This approach holds many promises.

However, adding out-of-core capabilities to pipelines, grid search, and other algorithms will require rewriting some of scikit-learn’s basic blocks, which will surely take time.

In the end what counts is that current out-of-core capabilities of scikit-learn do give you the possibility to train a whole set of standard models as fast as if you were dealing with a small dataset. That’s pretty exciting!

Further Reading and Viewing

You can read more from Alex Perrier on his blog.