When Data Science Comes Together with Stream Processing: Apache Flink When Data Science Comes Together with Stream Processing: Apache Flink
Practically all business-related data is produced as an infinite stream of events: sensor measurements, website engagements, geo-location data from industrial IoT... When Data Science Comes Together with Stream Processing: Apache Flink

Practically all business-related data is produced as an infinite stream of events: sensor measurements, website engagements, geo-location data from industrial IoT devices, database modifications, stock trades, and financial transactions, to name a few. Successful data science organizations are those that can go beyond just discovering valuable insights once but can do so continuously and in real-time. Stream processing is a rapidly-growing data processing paradigm for real-time analytics and event-driven applications. Apache Flink is a distributed data processing framework for stateful computations over unbounded and bounded data streams. 

[Related Article: Emerging Need of Cloud Native Stream Processing]

What’s changing faster, the query or the data?

All data tasks begin with some sort of exploration of a data set; cleaning, combining and analyzing data sets to discover interesting and valuable insights. This analysis occurs over data parked somewhere, like a database or data lake, where data scientists try to understand it by probing the data with a series of ad-hoc queries, building models, etc. 

But what happens when this analysis is complete? The results of such computations and data exploration tasks—like machine learning pipelines or reporting queries—often become part of streaming applications when they are applied in a production environment. In this way, stream processing is data science in production.

Apache Flink comes with a unified approach to data processing combining both unbounded (real-time) and bounded (offline) data under the same system.

Unbounded data have a start but no defined end. Unbounded events must be continuously processed, i.e., events must be promptly handled after they have been ingested. Since the data is continuously generated, it is impossible to wait for complete input, something that makes the tradeoff between speed of data computation and accuracy of the computation a necessity. This is executed with the help of ordering and timers that support users when reasoning about result completeness.

Bounded data on the other side, have a defined start and end. Bounded data can be processed by ingesting all inputs before performing any computations. With bounded data, contrary to unbounded data, ordering and timing is not a requirement because a bounded data set can always be sorted. We can think of processing of bounded streams is as batch processing.

stream processing apache flink

Important features of Apache Flink

Apache Flink is a distributed data processing framework, purposely-designed to perform stateful computations over data streams. The Flink runtime is optimized for processing unbounded data streams as well as bounded data sets of any size. Some of the most notable features of Apache Flink include the following:

  • State Management

One of the most important aspects of stream processing is dealing with state. State is a system’s ability to remember past input and use it to influence the processing of future inputs or events. State is essentially any information that an application or stream processing engine will remember across events and streams as more data flows through the system.

  • Fault Tolerance

Apache Flink provides highly-available and fault-tolerant stream processing; Flink supports exactly-once semantics even in the case of failure. When referring to “exactly-once semantics,” you can think of performing stream processing with Apache Flink where each incoming event affects the final results exactly once. Even in the case of a machine or software failure, there’s no duplicate data and no data that goes unprocessed. Flink provides exactly-once semantics with a feature called checkpointing. You can find more information about checkpoints in the Flink documentation to get a thorough overview of the feature.

  • Versioning with Savepoints

Apache Flink’s savepoints allow Flink users to fix issues, reprocess data, update code and upgrade any Flink application easily and with data consistency. A Savepoint is a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism. You can use Savepoints to stop-and-resume, fork, or update your Flink jobs. Savepoints consist of two parts: a directory with (typically large) binary files on stable storage (e.g. HDFS, S3, etc.) and a (relatively small) metadata file. Conceptually, Flink’s Savepoints are different from Checkpoints in a similar way that backups are different from recovery logs in traditional database systems. The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink, i.e. a Checkpoint is created, owned, and released by Flink—without user interaction. On the contrary, Savepoints are created, owned, and deleted by the user. Savepoints are designed for planned, manual backup and resume such as in scenarios of updating a Flink version, changing the job graph or parallelism of an application, forking a second job, etc. 

  • Event-Time Handling

Oftentimes in stream processing when implementing real-time data pipelines you need to factor in additional latencies coming from data traveling through a cellular network or being distant from the data center. Due to such scenarios, data out-of-orderness is regularly present in streaming architectures. Apache Flink embraces the notion of event time, guaranteeing that out of order events are handled correctly and that results are accurate. Event time is the time that each individual event occurred on its producing device. This time is typically embedded within the records before they enter Flink, and that event timestamp can be extracted from each record. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. For more information regarding the different notions of time in Apache Flink, you can visit the Flink documentation

  • Flink SQL

Apache Flink offers a SQL API, that makes it possible for business and non-Java/Scala users to harness the power of stream processing. With the use of Flink’s relational APIs—the Table API and SQL—users can run queries for unified stream and batch processing. While the Table API is a language-integrated query API for Scala and Java, Flink’s SQL support is based on Apache Calcite which implements the SQL standard. Queries specified in either interface have the same semantics and specify the same result regardless of whether the input is a batch or a stream input. With Flink SQL, developers and data scientists can quickly explore data in streams, data at rest (for example, sitting in a database or file storage like HDFS) and build powerful data transformation or analysis pipelines. You can find more information about the Flink SQL supported Syntax, Operations and DDL in the Flink documentation. 

[Related Article: Stream Data Processing with Apache Kafka and TensorFlow]

Flink is able to maintain very large state with exactly-once consistency guarantees, perform local state accesses with low latency, and manage the lifecycle of applications via savepoints. Flink is an excellent choice to power stateful streaming applications. The development of Apache Flink is growing rapidly with more libraries, support and features added with every release. We encourage you to download the latest Flink release and explore the features presented in this article or subscribe to the Flink Mailing list to stay up-to-date with all the latest developments in the framework.

About the author: 

Seth Wiesman is a Senior Solutions Architect at Ververica, consulting clients to maximize the benefits of real-time data processing for their business. He supports customers in the areas of application design, system integration, and performance tuning. Prior to joining Ververica, he was a Data Engineer on the reporting team at MediaMath and has a Masters in Computer Science from the University of Missouri.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.