Up to 95% of a data scientist’s time is spent data wrangling. Conversely, about 99% of data-scientists hate data wrangling. That’s problematic.
Data wrangling tends to be the most redundant and mind-numbing process associated with building Machine Learning (ML) models. There are four steps to building an ML model: data processing, data cleaning, feature engineering, and finally, model selection. Yet many ML automation platforms ignore the first three steps (data wrangling) of the model building process completely and focus on model selection. Model selection typically involves iterating through hundreds of models to find a model and set of model hyper-parameters which optimally capture the distribution of the data. Unfortunately, once you have a nicely formatted, cleaned, and featurized data-set, model selection is a rather academic task.
So why not build solutions to help automate data wrangling for the data scientist? There’s a good reason. It’s really hard.
In this blog post, we begin to tackle this big enchilada of automating data wrangling. We will start by drawing inspiration from one step of the data wrangling process: Feature Engineering.
Feature Detection in Computer Vision
During my PhD, I worked on building ML models to learn representations of object categories. Data wrangling, in particular, feature engineering, played an oversized role in all of my work and was the deciding factor in getting that strong model performance necessary to get into top-tier computer vision conferences (and ultimately in allowing me to graduate). Feature engineering in object recognition takes the form of detecting interesting parts of an image based on complex changes in gradients (see Figure 1). A “feature detector” detects these interesting parts of an image. Unfortunately, I, and everyone else working on object recognition at the time, was never sure what feature detector to use and at what resolution(s) to apply that feature detector. It was a black art. What did we end up doing? Iterating through and trying dozens of feature detectors on images and picking the one which worked best for a particular problem. Brute force.
Figure 1: Example of SIFT feature detectors on a common image. SIFT was one of the hot feature defectors during my PhD. I would literally run dozens of feature detectors over images to generate the best performance.
The rote and iterative task of trying different feature detectors gave us an idea as we approached the problem of feature engineering in a broader context for ML problems at Vidora. Could we build a framework to automate these processes and free data scientists from some of the day-to-day drudgery associated with data wrangling?
The Machine Learning Pipeline
Figure 2: The Machine Learning pipeline automates data wrangling and enables one to build models from raw data.
The ML pipeline is the central framework of Vidora’s solution. The ML pipeline is discretized into the four steps introduced earlier. Each of these steps is comprised of a large number of separate modules which do different things and can be used for any particular ML problem.
Let’s consider feature engineering again. Possible feature engineering modules for ML problems focused on predicting customer behavior include:
- Summing customer events over a specified time period
- Looking at changes in customer activity
- Sequencing customer events
There is actually an infinite number of possibilities of different feature engineering techniques. Vidora maintains a repository describing various feature engineering techniques for ML problems and their relative efficacy. Data scientists tend to have strong intuitions around which technique to use for which problem. What the ML pipeline does is automate the task of selecting which feature engineering techniques to use through a combination of (a) cleverly searching through the space of feature engineering techniques – much as I searched through the space of feature detectors during my PhD and (b) learning which feature engineering techniques have worked well in the past on similar data-sets. The latter is actually a form of meta-learning which is and will be an increasingly active space of research within the ML community.
During the upcoming talk at the ODSC West Conference in October, we will dive more deeply into the ML Pipeline and techniques for cleverly searching and learning which data wrangling techniques to use for any particular ML problem. All in a herculean effort to help solve the bane of data scientists everywhere: data wrangling.
Vidora enables anyone in any business to build end-to-end machine learning pipelines. With Vidora’s self-service platform, Cortex, machine learning is intuitive, interpretable and fast, automating the entire machine learning pipeline from raw data to model outputs. Developed by experts in machine learning and artificial intelligence from Stanford, Berkeley, and Caltech, Cortex sits at the heart of some of the largest global brands, such as Walmart, News Corp, and Discovery. Learn more at www.vidora.com