fbpx
Versatile Spark – Streaming Versatile Spark – Streaming
Distributed Computing is the fuel for large scale processing in modern data pipelines. Hadoop and its open-source competitors tool this system... Versatile Spark – Streaming

Distributed Computing is the fuel for large scale processing in modern data pipelines. Hadoop and its open-source competitors tool this system together. In recent years, rival Apache Spark gained favor due to its versatility. As preference for Apache grows, the software diversifies and its applications increase.

Apache Spark is a product of the University of Berkeley’s famous AMPLab. The project uses the language Scala, but also has APIs for R, Python, and Java. Apache Spark’s core functionality is parallelization and distributed processing. Spark’s other components leverage the power of this core to fulfill different functions. Spark SQL allows users to query Spark objects just like they would with SQL. MLlib is a growing engine for distributed machine learning, and GraphX is Spark’s library for graph processing. Finally, Spark Streaming is an engine for real-time streaming and processing of data.

This component’s importance is increasing as companies rush to get real-time insight from data. Interoperability is a major advantage Spark has over other platforms for real-time streaming. The Discretized Stream is Spark Streaming’s data structure. It’s an extension of Spark’s main data structure, the Resilient Distributed Dataset. This facilitates a pipeline which funnels data from Spark Streaming to another Spark component. Streaming live Twitter data into a MLlib model to do real-time sentiment analysis is one use case.

The recent release of Spark 2.0 features increased speed, a more streamlined API, and the introduction of structured streaming. The future continues to look bright.

 
©ODSC 2016, Feel free to share + backlink!

Gordon Fleetwood

Gordon studied Math before immersing himself in Data Science. Originally a die-hard Python user, R's tidyverse ecosystem gradually subsumed his workflow until only scikit-learn remained untouched. He is fascinated by the elegance of robust data-driven decision making in all areas of life, and is currently involved in applying these techniques to the EdTech space.

1