When data scientists train models, they build features specifically for the development environment. But those features have to be rewritten by data engineers to make them production-ready. After many years in the machine learning field and seeing this siloed process occur over and over, I’ve spent this past year advocating for a better solution, one that shortens development cycles, reduces the risk training-serving skew that could result in inaccurate models in production due to code changes and supports real-time feature engineering. That solution is the feature store.
The Year of the Feature Store
2021 has been dubbed the “Year of the Feature Store” by many members of the ML community. An emerging technology, feature stores have transformed operational ML pipelines (MLOps) and provided a solution for bringing AI projects to production in a powerful, standardized, and methodical manner, by making feature creation and feature sharing easier.
The feature store is popularly referred to as a centralized repository to store, share and manage features. But the more significant function some feature stores perform is as a data transformation service that can do fresh calculations on streaming data, like sliding window aggregations.
An input variable to a machine learning model, a feature is a piece of data that depicts a phenomenon. There are two types of features: offline and online. Offline features do not change often, are processed in batches and are calculated with Spark or SQL.
Online features, on the other hand, are dynamic and require a processing engine. Sometimes, these calculations need to take place in real-time. Data for these features is stored in memory or in a very fast key-value database. The process itself can be performed on various services in the cloud or on a dedicated MLOps platform.
Feature stores run scalable, high-performance data pipelines to transform raw data into features. This is done by enabling ML teams to define features once and deploy to production without rewriting.
This significant reduction of effort is particularly important in use cases where real-time feature engineering is required. To understand why these capabilities are so significant, consider the many use cases that have been affected by the radical shifts in behavior surrounding the COVID-19 pandemic. Whether a model concerns consumer behavioral data, market trends, demand forecasts, fraud prediction and so much more, it’s now extra critical that business applications can adapt to fresh data. It has always been important to track models and maintain accuracy as a core step of any MLOps strategy, but the upheaval of the past couple of years has put additional pressure on organizations to adapt to fresh data with online feature engineering.
Among the advantages of the feature store, we can list reducing duplicate work, saving time, keeping features accurate, preventing drift, maintaining a single source of truth and supporting security and compliance activities.
Do You Need a Feature Store? It Depends.
Complex use cases that require deploying and managing multiple models in production will benefit from a feature store, especially in the case of real-time data. Real-time ML pipelines demand very quick event processing mechanisms to calculate features in real-time. For example, AI recommendation engines or fraud prevention applications require a response time in the range of milliseconds.
To be able to support such low-latency event processing, data scientists and engineers need the right set of tools, which are often not properly supported by the Spark computing model. A feature store solves these complex problems, by using the same logic for training and serving features. Calculation time is significantly reduced, a critical factor in real-time use cases.
The One Characteristic that Will Make or Break Your Feature Store
Several new feature stores have become publicly available in the past year, and many more are expected to come. When deciding on which one of these feature stores to implement, be sure to validate it can integrate with other components in your MLOps stack. Using an integrated feature store will make life simpler for everyone on your team, with monitoring, pipeline automation, and multiple deployment options already available, without the need for lots of glue logic and maintenance.
About the author:
Adi Hirschtein has more than twenty years of experience as an executive, product manager, and entrepreneur building and driving innovation in technology companies primarily focusing on big data, databases, and storage technologies. During his career Adi held various management roles in both startups and large corporations, driving products from inception to wide market adoption.