The History of Big Data Processing in 5 Critical Papers The History of Big Data Processing in 5 Critical Papers
Read the History of Big Data Processing in These 5 Papers Big data is a multi-faceted area of interest and growth... The History of Big Data Processing in 5 Critical Papers

Read the History of Big Data Processing in These 5 Papers

Big data is a multi-faceted area of interest and growth in today’s digital world. While many understand the core concepts of big data, the history is lesser known. Individuals interested in big data processing can brush up on its history and core components in the following 5 papers:

  1. The Google File System
  2. MapReduce: Simplified Data Processing on Large Clusters
  3. Spark: Cluster Computing with Working Sets
  4. Kafka: a Distributed Messaging System for Log Processing
  5. Occupy the Cloud: Distributed Computing for the 99%

We break down the review of these papers into two different parts: batch processing and streaming analytics. All papers are easily accessible and provide the foundations for the field of big data.

Big Data wasn’t created with Hadoop, but we will see that it helped to make it accessible to organizations. There were several systems to handle massive amounts of data. What Hadoop did, was to offer an affordable distributed, fault-tolerant data processing on commodity machines, not supercomputers or specialized hardware, but machines you would get from a tech store or online distributor.

Hadoop, inspired by papers from Google engineers, handled massive amounts of data and processed them in a distributed way, optimizing processing time because it processed where the data was stored minimizing data movements and traffic on the network, using it for coordination. With Hadoop a massive dataset is sharded and replicated to multiples machines in the cluster, when a processing request comes in the processing occurs on the machines with the data in local storage whenever is possible, this data locality feature is a good improvement for clusters using low speed networks. Hadoop was designed for batch processing, to efficiently calculate for example the PageRank of each website on the web, executing a query on a massive dataset and then waiting for the response and continue, this offline or batch processing was a huge improvement but it was fitted for certain type of jobs.

Spark continued improving batch processing performance, making use of in memory processing with lazy start improving even more the time response of Hadoop tasks for certain tasks such as iterative processing for data mining tasks introducing the use of cache, and interactive querying to the data.

After Hadoop and Spark, both efficient tools for batch processing we could see the emergence of a new type of data processing, the processing of streaming data. Streaming processing allows you to have access to the data as soon as it is produced, and Kafka through a set of simple rules allowed to companies to better organized their data exchanges and having access to the data is been produced in their systems and offered to the systems that required this data.

Batch Processing

Google Papers Influence Creation of Hadoop

In 2003 and 2004, engineers from Google published two papers about how Google maintained its large data processing systems: The Google File System [1] and the MapReduce: Simplified Data Processing on Large Clusters [2]. Inspired by these papers, two Yahoo! engineers, Doug Cutting and Mike Cafarella, spin off the storage and processing components of their project, Nutch, to create the initial version of the Hadoop Project as an Apache Foundation project in 2006.

As Hadoop’s different module features grew, commercial scalability of its use became a reality in 2008. Three companies soon appeared in big data world based on Hadoop’s programs and procedures:  Cloudera Inc. in 2008, MapR in 2009, and HortonWorks in 2011, with ~$1Bn, $280M, and $248M respectively in rounds of investment to date. Each company offers its own customized version of Hadoop. To ensure the livelihood and growth of the Hadoop project, Cloudera and HortonWorks have a large number of committers (developers, coders, etc) and project management committees (PMCs) to the initiative.

Cloudera and Hortonworks count with most of the PMCs (Project Management Committee) and Committers to Apache Hadoop. PMC are a high level decision makers and governance of the project. Committers are the developers in charge of accepting and making official changes to the source code of the project.

From Hadoop to Spark

By design, Hadoop was meant to act as a system that processed massive amounts of data with ease. There were, however, limitations in such a design. One of the most apparent deficiencies was that Hadoop was ineffective as performing other tasks. An example of this ineffectiveness was in performing iterative jobs which are commonly used to train machine learning models. Another example is through interactive analysis, because you had to wait, sometimes dozens of seconds, for the results to come back to the user after simple exploratory query to the data.

The Spark framework offers improvements over Hadoop. It uses the benefits of the HDFS (Hadoop Distributed File System) which is the Hadoop implementation of the Google File System (GFS) from the first paper. Spark loads the data to memory to process it with the option of maintaining the datasets in memory with the option to be cached, meaning the data is kept in memory for future or recurrent use instead of reading and processing it again from disk. Another interesting Spark feature is the one known as “lazy computation”, it starts processing only when results are needed and not before. This type of feature makes the processing of massive volumes of data more efficient because it can optimize the best way and schedule the operations on the datasets.

The Spark: Cluster Computing with Working Sets [3] paper explains the capacities of the system and how it enables improves upon previous systems. The paper was published in 2010. It presented the Spark framework created by M. Zaharia et al, when he was part of the AMPlab a research centre part of UC Berkeley.

Side Note:

Spark was written in Scala, a programming language for Java Virtual Machine (JVM) created by Martin Odersky in 2003 (for further understanding, read Scala, the language for Data Science). Scala offers the best features from two different programming approaches: object-oriented programming and functional programming. The distributed computation feature of the Scala language was a good fit for to aim of making Spark, a distributed system. In 2011, Martin Odersky funded Typesafe, known today as Lightbend, and they are responsible for the Scala language, and Akka among other cool and reactive tools. The company has raised $52M in investment.

In 2010, Spark was launched as an open source project. Soon after, in 2012, Spark becomes an Apache project. In 2013, Stoica, Zaharia and others co-funds Databricks the company behind Spark, engaging and increasing the adoption of Spark since then. Databricks has raised $247M in investment up today.

Side Note:

The Spark project came from the AMPlab, a research centre at UC Berkeley. AMPLab created interesting projects such as Mesos and Spark, and publications in the five-year period that project lasted. Although AMPlab closed in 2016, it made way for a new research centre called the RISElab, a follow five-year project focused on Real-time Intelligence with Secure Explainable decisions. 

Data Integration is Key – Kafka Comes to Be

The realm of big data exploded when Hadoop entered the market as it offered affordable batch processing of massive amounts of data in an Open Sourced solution format. After Hadoop and Spark, we realized that we could  process big amounts of data at a faster pace than ever before,

As more and more data sources were added to different big data ecosystems, it became necessary to have a scalable solution in place to handle a variety of data sources, massive volumes of data and, and at a new pace for information systems. Data integration had become vital.

New data sources includes not only the “standard” relational databases, but also noSQL, user events, application logs, operational metrics and others. Due to rising data integration needs across multiple sources, it became not only necessary to execute batch processing seamlessly but also imperative to have real-time analysis or near real-time analysis capabilities. This realization triggered the query that since organizations to have way more data available than in the past, so how can we handle these new and multiple, and often massive, data sources effectively and quickly?

While at LinkedIn, Jay Kreps realized this problem. Using a set of simple rules, he designed Kafka, a platform for real time datastreams. Kreps designed this platform so that it would be possible to create a representation of the pipelines and data integration processes in a central repository. This repository acts as a commit log of changes, persistent, replicated across machines. Essentially, Kafka simplified data infrastructures by offering a scalable solution to handle multiple data sources. This platform makes the data available to consumer systems as soon as it is created by producer systems.

Then in 2011, Kafka was presented in Kafka: a Distributed Messaging System for Log Processing [4]. Soon after, Confluent was funded by Kreps, and the company is behind Kafka. Confluent has raised $80M since then. Kafka is written in Scala and Java.

Kafka opened the door to think more in a streamed way of processing data. We can see really interesting streaming analytic solutions, tools and approaches such as Apache Beam, Apache Flink, Spark Streaming, Apache Storm, Heron, among others.

New Level – Optimized Processing

Finally, I’d like to point out a vision paper called Occupy the Cloud: Distributed Computing for the 99% [5]. This paper explains the concept of “serverless architecture” and its benefits such as elasticity and simplicity. One of the authors of this paper, Ion Stoica, also co-wrote the Spark paper. I highly recommend this paper to read. Now let’s see if we can more efficiently use the installed resources we have available.


There are countless publications that provide insights into data science and big data. The ones I presented in this articles, however, are, in my opinion some of those “cornerstone” publications that that helped shape the world of  big data as we see it today.

Hadoop made it possible to for companies to process massive volumes of data on commodity machines thanks to the HDFS and MapReduce. Spark brought important improvements to the MapReduce model while using the Hadoop Distributed File System. As the number of data sources grew, the issue of data integration became more prevalent; a problem that Kafka took care of for companies broadening their horizons into Big Data.

All of this was made possible for two reasons. One is that the papers that contributed to the evolution of big data were made available to the general public. And two, in the fact that all the projects were open sourced. Improvements to each project/platform were possible because developers had access to the platform and its source code, understood the way they worked, and improved or designed something better or different. This approach to projects resonates with me as I believe that a big driver of Big Data is Open Sourcing the code of the solutions and platforms. Such an approach maximizes collaboration and creativity.

Big Data is an exciting and evolving world. At a quickening pace, we see new  players in the technology space disrupting how things work either by optimizing workflows, improving upon failures or gaps in previous technologies and more. The fast pace in which these new technologies come about is possible because of the open source nature of the projects, platforms, and frameworks relevant to the creation of the technologies. It is an exciting time to make contributions, test and adopt new technologies. Embrace change and don’t fool yourself believing you have the ultimate tool because by the time you get to production, there is something newer going on. Keep creating, keep experimenting and keep breaking down the barriers.



[1] Ghemawat, S., Gobioff, H., & Leung, S. T. (2003). The Google file system. ACM SIGOPS operating systems review, 37(5), 29-43.

[2] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.

[3] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. HotCloud, 10(10-10), 95.

[4] Kreps, J., Narkhede, N., & Rao, J. (2011, June). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (pp. 1-7).

[5] Jonas, E., Pu, Q., Venkataraman, S., Stoica, I., & Recht, B. (2017, September). Occupy the cloud: distributed computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing (pp. 445-451). ACM.

Diego Arenas

Diego Arenas, ODSC

I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.