Stream Data Processing with Apache Kafka and TensorFlow Stream Data Processing with Apache Kafka and TensorFlow
Editor’s note: Yong is a speaker for the upcoming ODSC East 2019 this April 30 – May 3! Be sure to... Stream Data Processing with Apache Kafka and TensorFlow

Editor’s note: Yong is a speaker for the upcoming ODSC East 2019 this April 30 – May 3! Be sure to check out his talk, “Deep Learning for Real Time Streaming Data with Kafka and TensorFlow.”

As one of the most popular deep learning frameworks, TensorFlow has been used widely adopted in production across a broad spectrum of industries. The upcoming TensorFlow 2.0, which was announced recently, will be released early this year with many changes. The high-level tf.keras API, eager execution, and tf.data greatly simplify the usage. Here is a good example of model training in several lines:


import tensorflow as tf

mnist = tf.keras.datasets.mnist


(x_train, y_train),(x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255.0, x_test / 255.0


model = tf.keras.models.Sequential([

 tf.keras.layers.Flatten(input_shape=(28, 28)),

 tf.keras.layers.Dense(512, activation=tf.nn.relu),


 tf.keras.layers.Dense(10, activation=tf.nn.softmax)






model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test, y_test)



[Related article: Top-Rated Development Tools for Machine Learning]

However, the data input processing component tf.data in TensorFlow only covers a small set of file formats. Users from different industries often encounter challenges integrating TensorFlow with data sources not commonly seen in the machine learning community.

One example is the integration of TensorFlow with Apache Kafka. Kafka is widely used for stream processing and is supported by most of the big data frameworks such as Spark and Flink. For a long time, though, there was no Kafka streaming support in TensorFlow. The data formats such as TFRecords and tf.Example in TensorFlow are also rarely seen in big data or data science community.

Many users are forced to consolidate these two frameworks in a very awkward way: setup another infrastructure, read messages from Kafka, convert the messages into TFRecord format, invoke TensorFlow to read the TFRecord from a file system, run the training or inference, and save the models or results back to the file system. This process is really error-prone and hard to maintain from an infrastructure perspective.

Stream Data Processing with Apache Kafka and TensorFlowWith help from the community, Apache Kafka streaming support for TensorFlow has been released recently as part of the tensorflow-io package (https://github.com/tensorflow/io) by TensorFlow’s SIG IO. SIG IO is a special interest group under TensorFlow organization, with a focus on I/O, streaming, and file formats support. In addition to Apache Kafka streaming, tensorflow-io also includes support for a very broad range of data formats and frameworks. It supports Apache Ignite for memory and caching, Apache Parquet and Arrow for serialization, AWS Kinesis and Google Cloud Pub/Sub for streaming, and many video, audio, and image file formats. Both Python and R language could be used, which are especially convenient to the data science community.

It is worth to mention that tensorflow-io is implemented as a part of the tf.data pipeline and natural extension of TensorFlow 2.0 API. In other words, users are able to read the data from Kafka and pass it to tf.keras. Training or inference is exactly the same simple and succinct way as the example shown at the beginning of this article.

[Related article: Data Ops: Running ML Models in Production the Right Way]

We hope this article could help a broader community to adopt TensorFlow. We will also provide more details in the upcoming ODSC East 2019 session, “Deep Learning for Real-Time Streaming Data with Kafka and TensorFlow.”

Yong Tang

Yong Tang is Director of Engineering at MobileIron. His most recent focus is on data processing in machine learning. He is a maintainer and the SIG I/O lead of TensorFlow project. He received the Open Source Peer Bonus Award from Google for his contributions to TensorFlow, and is the author of Kafka Dataset module in TensorFlow. In addition to TensorFlow, Yong Tang also contributes to many other projects for the open source community. He is a maintainer of Docker, CoreDNS, and SwarmKit. Yong Tang received his PhD in Computer Science & Engineering at the University of Florida.