Streaming Machine Learning Without a Data Lake
Machine LearningModelingEurope 2023posted by ODSC Community May 22, 2023 ODSC Community
Editor’s note: Kai Waehner is a speaker for ODSC Europe this June. Be sure to check out his talk, “Apache Kafka for Real-Time Machine Learning Without a Data Lake,” there!
The combination of data streaming and machine learning (ML) enables you to build one scalable, reliable, but also simple infrastructure for all machine learning tasks using the Apache Kafka ecosystem. This blog post features a predictive maintenance use case within a connected car infrastructure, but the discussed components and architecture are helpful in any industry.
The Apache Kafka ecosystem is used more and more to build scalable and reliable machine learning infrastructure for data ingestion, preprocessing, model training, real-time predictions, and monitoring. I had previously discussed example use cases and architectures that leverage Apache Kafka and machine learning.
Here’s a recap of what this looks like:
The old way: Kafka as an ingestion layer into a data lake
A data lake is a system or repository of data stored in its natural/raw format—usually object blobs or files. It is typically a single store of all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning. Commonly used technologies for data storage are the Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage (GCS), or Azure Blob Storage, as well as tools like Apache Hive, Apache Spark, and TensorFlow for data processing and analytics. Data processing happens in batch mode with the data stored at rest and can take minutes or even hours.
Apache Kafka is an event streaming platform that collects, stores, and processes streams of data (events) in real-time and in an elastic, scalable, and fault-tolerant manner. The Kafka broker stores the data immutably in a distributed, highly available infrastructure. Consumers read the events and process the data in real-time.
A very common pattern for building machine learning infrastructure is to ingest data via Kafka into a data lake.
From there, a machine learning framework like TensorFlow, H2O, or Spark MLlib uses the historical data to train analytic models with algorithms like decision trees, clustering, or neural networks. The analytic model is then deployed into a model server or any other application for predictions on new events in batch or in real-time.
All processing and machine-learning-related tasks are implemented in the analytics platform. While the ingest happens in (near) real-time via Kafka, all other processing is typically done in batch. The problem with a data lake as a central storage system is its batch nature. If the core system is batch, you cannot add real-time processing on top of it. This means you lose most of the benefits of Kafka’s immutable log and offsets and instead, now end up having to manage two different systems with different access patterns.
Another drawback of this traditional approach is using a data lake just for the sake of storing the data. This adds additional costs and operational efforts for the overall architecture. You should always ask yourself: do I need an additional data lake if I have the data in Kafka already? What are the advantages and use cases? Do I need a central data lake for all business units, or does just one business unit need a data lake? If so, is it for all or just some of the data?
Unsurprisingly, more and more enterprises are moving away from one single, central data lake to use the right data store for their needs and business units. Yes, many people still need a data lake (for their relevant data, not all enterprise data). But others actually need something different: a text search, a time series database, or a real-time consumer to process the data with their business application.
The new way: Kafka for streaming machine learning without a data lake
Let’s take a look at a new approach for model training and predictions that do not require a data lake. Instead, streaming machine learning is used: direct consumption of data streams from Kafka into the machine learning framework.
This example features the TensorFlow I/O and its Kafka plugin. The TensorFlow instance acts as a Kafka consumer to load new events into its memory. Consumption can happen in different ways:
- In real-time directly from the page cache: not from disks attached to the broker
- Retroactively from the disks: this could be either all data in a Kafka topic, a specific time span, or specific partitions
- Falling behind: even if the goal might always be real-time consumption, the consumer might fall behind and need to consume “old data” from the disks. Kafka handles the backpressure
Most machine learning algorithms don’t support online model training today, but there are some exceptions like unsupervised online clustering. Therefore, the TensorFlow application typically takes a batch of the consumed events at once to train an analytic model.
With streaming machine learning, you can directly use streaming data for model training and predictions either in the same application or separately in different applications. Separation of concerns is a best practice and allows you to choose the right technologies for each task. In the following example, we use Python, the beloved programming language of the data scientist, for model training, and a robust and scalable Java application for real-time model predictions.
The whole pipeline is built on an event streaming platform in independent microservices. This includes data integration, preprocessing, model training, real-time predictions, and monitoring:
The above example uses the Kappa architecture replacing Lambda in more and more enterprises.
Streaming machine learning simplifies machine learning infrastructure
A data streaming platform is the core foundation of a cutting-edge machine learning infrastructure. Streaming machine learning—where the machine learning tools directly consume the data from the immutable Kafka log—simplifies your overall architecture significantly. This means:
- You don’t need another data lake
- You can leverage one pipeline for model training and predictions
- You can provide a complete real-time monitoring infrastructure
- You can enable access through traditional BI and analytics tools
The described streaming architecture is built on top of the data streaming platform Apache Kafka. The heart of its architecture leverages the event-based Kappa design. This enables patterns like event sourcing and CQRS, as well as real-time processing and the usage of communication paradigms and processing patterns like near real-time, batch, or request-response. Tiered Storage enables long-term storage with low cost and the ability to more easily operate large Kafka clusters.
This streaming machine learning infrastructure establishes a reliable, scalable, and future-ready infrastructure using frontline technologies, while still providing connectivity to any legacy technology or communication paradigm.
If you want to learn more about this innovative architecture, and why this still complements (not replaces!) a data lake, join my ODSC talk live or on demand.
About the author/ODSC Europe speaker:
Kai Waehner is Field CTO at Confluent. He works with customers and partners across the globe and with internal teams like engineering and marketing. Kai’s main area of expertise lies within the fields of Data Streaming, Analytics, Hybrid Cloud Architectures, and the Internet of Things. Kai is a regular speaker at international conferences, writes articles for professional journals, and shares his experiences with industry use cases and new technologies on his blog: www.kai-waehner.de. Contact: email@example.com / Twitter / LinkedIn.