Distributed Computing is the fuel for large scale processing in modern data pipelines. Hadoop and its open-source competitors tool this system together. In recent years, rival Apache Spark gained favor due to its versatility. As preference for Apache grows, the software diversifies and its applications increase.
Apache Spark is a product of the University of Berkeley’s famous AMPLab. The project uses the language Scala, but also has APIs for R, Python, and Java. Apache Spark’s core functionality is parallelization and distributed processing. Spark’s other components leverage the power of this core to fulfill different functions. Spark SQL allows users to query Spark objects just like they would with SQL. MLlib is a growing engine for distributed machine learning, and GraphX is Spark’s library for graph processing. Finally, Spark Streaming is an engine for real-time streaming and processing of data.
This component’s importance is increasing as companies rush to get real-time insight from data. Interoperability is a major advantage Spark has over other platforms for real-time streaming. The Discretized Stream is Spark Streaming’s data structure. It’s an extension of Spark’s main data structure, the Resilient Distributed Dataset. This facilitates a pipeline which funnels data from Spark Streaming to another Spark component. Streaming live Twitter data into a MLlib model to do real-time sentiment analysis is one use case.
The recent release of Spark 2.0 features increased speed, a more streamlined API, and the introduction of structured streaming. The future continues to look bright.
©ODSC 2016, Feel free to share + backlink!