Scott is presenting at ODSC East 2019 this May in Boston! Check out his workshop “Real-ish Time Predictive Analytics with Spark Structured Streaming.”
With the advent of so many wonderful open-source tools and frameworks for machine learning & deep learning, it can sometimes be difficult to understand just what defines the subset of the problem space defined by predictive analytics. This is mainly due to the fact that classification problems, statistics, and probability bridge the gap amongst many branches of the computerized learning umbrella.
Therefore, it’s worth understanding what steps are critical when establishing a system capable of delivering real(ish) time predictive analytics, as well as mapping out the path to get there. Like many things, we start by establishing first principles.
Asking the Right Questions
Your data can either assist you or become a stubborn thorn in your side. This is why it is critical to ask the right upfront questions in order to first understand if you can solve “any problems” with your current data, or if you must take a step back to define different data structures. This process would entail the collection, extraction, and possibly joining of data from new or alternative sources in order to achieve a more focused and better understanding of the data and its innate behavior and statistical patterns. This work is required to form the basis of even the most primitive models.
Understand the Shape of your Data
Good data mining skills are critical for this early understanding of what is and isn’t possible with your data. This first step is usually referred to as exploratory data analysis and typically includes generating summary column statistics such as min, first quartile, second quartile, third quartile, max, as well as mean, standard deviation, and 99th percentile for continuous numeric data types as well as lexicographical min and max from categorical types or labels for the set of fields in your initial static dataset.
Often this presents itself as a perfect time to explore how your data naturally falls into clusters. By applying any one of many clustering heuristics/algorithms like hierarchical clustering or frequent item analysis (apriori), you can view how your data naturally groups or organizes itself. You could test bucketing data by range or removing the skewness in the data by applying log transformations or, calculate distance statistics like MAD to inspect perhaps normalizing data using like Z Score Normalization. In this exploration phase, it is also very optimal time to take a look at dimensionality reduction techniques like PCA or PLS, so you can understand which properties of your data speak louder than others.
The end result of this early data exploration should allow you to establish trust in the data. You should also gain a solid understanding of both the statistical properties of your data as well as a more intimate understanding of the nuances that represent the underlying behavior or characteristics of your data.
This is the base upon which you will start to build your predictive models, so cutting corners here will lead to devastating results down the line since defining the right problem on the wrong dataset will yield unreliable predictions.
Learning from the Past
Once you have established trust in your data, generated a good structure to encapsulate said data, and started collecting and storing that data (be it in HDFS or MySql or some other data store), then you can begin the process of building a predictive model.
Predictive modeling, in its most fundamental form, is any model capable of classifying a given record with a tag, label, or value based off of the probability that the record belongs in a specific group historically.
Predictive Modeling: with Labels
If you fall into the category of lucky individuals who happen to have labeled data (eg. data sets with explicit classifiers to train on), then there are many ready to use tools that will allow you to generate predictive models using supervised training and fitting techniques.
The simplest being single class logistic regression, as it will group results into one of two binary categories (true or false) while maintaining a fairly good level of accuracy for univariate predictions. More complicated predictive models such as Quantile Decision Trees (or Forests) can yield fast and accurate predictions even with sparse data and large continuous variables while also optimizing and selecting the best underlying predictors for multiple classifiers at the training phase.
Accounting for Zero or Partially Labeled Data
Most predictive modeling expects labeled data agnostic of the framework used. If you happen to fall into the category of folks who have good data albeit unlabeled then there is still hope!
This hope comes in the form of auto-labeling. Auto-labeling of data is a solution that combines domain knowledge with statistics (think back to the data mining work described above) to derive a set of thresholds or rules in which to automatically label data either in real-time or via batch.
This auto-labeled data can then be fed into semi-structured learning models that can still predict results without the need to manually label your data or pay for a team to do it for you.
In my workshop, “Real-ish Time Predictive Analytics with Spark Structured Streaming” in May at ODSC East 2019 in Boston, I will detail how you can apply both supervised and semi-supervised techniques to generate predictive models while also touching upon how to extend this to time-series predictions using Apache Spark Structured Streaming.
About the Author
Principal Software Engineer, Twilio
Twitter Handle: @newfront
Blog: medium @newfrontcreative
Scott Haines is a Principal Software Engineer on the Voice Insights team at Twilio. His focus has been on the architecture and development of a real-time (sub 250ms), highly available, trustworthy analytics system. His team is providing near real-time analytics that processes / aggregates and analyzes multiple terabytes of global sensor data daily. Scott helped drive Apache Spark adoption at Twilio and actively teaches and consulting teams internally. Scott’s past experience was at Yahoo! where he built a real-time recommendation engine and targeted ranking / ratings analytics which helped serve personalized page content for millions of customers of Yahoo Games. He worked to build a real-time click / install tracking system that helped deliver customized push marketing and ad attribution for Yahoo Sports and lastly Scott finished his tenure at Yahoo working for Flurry Analytics where he wrote an auto-regressive smart alerting and notification system which integrated into the Flurry mobile app for ios/android