Are these terms familiar?
Big Data, Machine Learning, Artificial Intelligence, Deep Learning, Hadoop, MapReduce, Spark, Streaming Analytics, Data Science.
In this post we will expand the executive’s knowledge and those aspiring to be by providing summaries by decade of the Data Science industry, and by comparing Data Science definitions. Spoiler alert: this is not written for kindergarteners, it’s not encoded with IT jargon, and there will be acronyms.
In the beginning there was Data. Then, data became information, information became knowledge, and knowledge, repeatable actions. This is generally the process of using data to improve and optimise businesses.
In the last steps, when we talk about Data Science, it all comes down to Information Systems. Information Systems are responsible for running analysis pipelines, machines, hardware, software, and storage, which, as a whole, is the Data Science environment.
There are Two Key Factors to successful Data Science projects. First, you start with a question, a problem, an opportunity and, then, you let Data Scientists go to work. Simply put, Scientists will suggest solutions after some data analysis. Second, they will ensure that executives see value in the project, so that those with domain knowledge can get on board.
By the end of this post you will have a much clearer understanding of several data related terms and how are they relate to each other.
- Describe the history of information technologies.
- Explain the terms of Data Science, Data Mining, and Machine Learning.
- Become familiar with big data approaches for batch processing and streaming analytics.
Data Science timeline
In the 50’s, the term Artificial Intelligence was coined for the first time. People working in different fields wanted to simulate the human intelligence. This led to different paths of development. From probabilistic approaches to world of rules.
In the 60’s the concept of Relational Databases appeared. In his famous paper, Codd, established the foundation rules of modern Data Warehouses. Before relational databases, the repositories were called “data banks”. It was suggested that a relational model outperformed the graph model as it was used by then; nowadays graph databases are part of the ecosystem of current data technologies.
In the 70’s the field of Neural Networks (NNs) started. A neural network is a structure that tries to classify or predict an output based on the input data. This initial work in neural networks is what we now know as Deep Learning (DL). A DL network can have several layers of hidden units or neurons, the word “deep” comes from this depth of the hidden layers in the network.
In the 80’s a “lot of data” meant something different than today, and will probably something totally different in 20 years. Due to advances in technology, more data processing techniques developed. As a result of the growing complexity, a new approach called Business Intelligence (BI) was coined. It tries to separate the operational world from the informational world. Dedicated data repositories, tuned to quickly respond to user queries and return reports, were set in place at each company eager to satisfy information needs.
In the 90’s the wide use of systems like ERPs (Enterprise Resource Planning systems), CRMs (Customer Relationship Management systems) and SCMs (Supply Chain Management systems) started to generate challenging amounts of data. The solution was the Data Warehouse. An enterprise-wide isolated, historic, non-volatile information environment to produce reports and make queries.
In 00’s, collecting huge amounts of data started be normal, which became a problem. Companies needed to figure it out how to process all the data they had. Also the NoSQL movement gained popularity. Big Data erupted and gave the possibility to process unthinkable amounts of data thanks to Massive Parallel Processing databases (MPPs), in-memory processing, distributed computing, etc. Leading up to this decade, information systems were designed to support the process of making business decisions, providing facts to help executives to achieve business goals and the objectives in business strategies. Big Data brought a fresh perspective, the idea of exploration and insights from vasts amounts of data freed people to discover how data can help their businesses. Big Data inverted the data problem and Data Scientists became crucial. Data Scientists, the workers in the Data Science world, are the ones surfing in Big Data trying out models and techniques to improve business.
In the 10’s, AI’s and the raise of Deep Learning and other machine learning techniques took over. The Big Data interest in batch data processing turns to stream data processing. Neural networks are implemented in frameworks, abstracting developers from implementing layers and cost functions from scratch.
Algorithms, Models, Frameworks and Techniques
It is important to clarify to the maximum extent the language we use. An algorithm is a set of steps to produce a result, they are generic approaches to solve specific problems. A decision tree is an algorithm, if you train decision trees with two sets of variables from the same data source, you will have two different models.
Examples of algorithms are: decision trees, naive bayes, support vector machines (SVM), neural networks, etc. Data scientists train several models and select the best one according to evaluation results, or, sometimes, based on the explicability of the model, they choose those models that are easier to interpret and to explain. A good method to follow, for instance, when selecting final models, is the Occam’s razor principle, you should select the method with the least assumptions, or in other words the simplest one.
Data Science, Data Mining, and Machine Learning
When we try to predict a number, an amount of something like tomorrow’s temperature, or the amount of income the company will receive next year, we are talking about a Regression problem; on the other hand, if we try to predict a category, for example, whether or not it is going to rain tomorrow, or if a customer will or will not buy a product, we are talking about a Classification problem. Both, regression and classification problems are Predictive Problems.
When we explore the dataset instead of predicting a result, we are talking about Descriptive problems. Clustering, correlations and association rules are among these problems. Clustering is when we try to group records, for example groups of Customers, maximizing similarities inside groups and minimizing commonalities between the different groups or clusters. We use correlations when we try to explain how two or more features of our dataset are related linearly. Association rules are powerful, they are used when we want to make recommendations based on historic data, when you notice people who have seen said item that have also seen this other item you get some flavour of association rules.
Predictive and Descriptive problems have been known as Data Mining problems by Industry, where until recent years most of the discovery was using structured data sources, databases and data warehouses; in Academia they are usually referred as Machine Learning (predictive) and Pattern Recognition (descriptive) problems, which include semi-structured and unstructured data. One of the main distinctions is the quality and provenance of the data used to model. In the end, we are talking about the same modeling techniques and algorithms.
Learning can be defined as the ability to adjust behaviour to a determined outcome; machine learning, is the ability of machines to adapt internal knowledge to optimize a specific output. Instead of creating rules to solve a problem, we show the input data to the machine and the machine will learn the pattern to optimize the output. These kind of problems are known as supervised learning, because we compare the result of the model vs the known result and tweak the model to improve results.
Another distinction we find in industry is the term Business Analytics (BA). We can understand BA as a more broad scope than Business Intelligence and Data Warehousing. BA contains Predictive + Descriptive + Prescriptive problems. Prescriptive problems are those that recommend the best action based on results from predictive models.
Analytics is an exchangeable term for Business Analytics. Now we can define Big Data Analytics, when we add huge amounts of data, from multiple sources, in a non-structured format, to our information needs.
In terms of programming languages to describe and analyse datasets, the most common languages are Python and R (see this Data Science Jobs Skills Analysis). But, when datasets don’t fit in memory we need to identify better approaches, Scala programming language becomes a good option for big data and data science problems.
Big Data Ecosystem
Companies are storing more data than ever because disk storage is cheaper than before. The fact that processors are not getting any faster, creates an opportunity for Big Data. The alternative is to use several cores and processors working at the same time. Using a scalable approach to process data, partitioning and chunking big data sets into smaller ones and processing the pieces on commodity machines. This feature, of distributing the computations and summarizing the results at the end, is what makes big data an enabler technology to process big data sets.
There are two main approaches to process big data: batch processing and stream processing. Batch data processing processes huge data files fixed in size, like weblogs, or transactional data tables. Stream data processing is the right approach when you have continuous flows of incoming data, when you need to use the data as soon as it is produced.
So, how much data is Big Data? A reasonable rule of thumb is if your data fits in your memory you still don’t have a Big Data problem at hand, which means you still don’t need a cluster of machines to process your data.
What’s Big Data? We can think of two cases when we talk about Big Data. When we refer to the 3V’s or 5V’s: Volume, Velocity, Variety, Veracity and Value of Data, and when we mention “Hadoop/MapReduce” solutions in our conversations.
Some Big Data platforms
Apache Hadoop is a popular big data platform. It implements the MapReduce framework allowing to tackle any problem with a two-phase solution. A first phase of transformation or operation over the data that can be performed in parallel (the map phase) and then a second phase to integrate the results (the reduce phase).
Apache Spark is another big data platform that allows to optimize processing times, very useful for repeatable operations over the same dataset. Apache Spark also has useful libraries like Spark Streaming for streaming processing, Spark MLLib provides machine learning algorithms to train in Spark, and Spark SQL to interact with Spark through a SQL language alike syntax.
Apache Kafka and Apache Flink are platforms for streaming analytics. Considering a continuous flow of incoming data and processing and delivering or serving the results as outputs of the system. Anomaly detectors can be implemented for example, or real-time telemetry data processing is another example.
Many Big Data platforms started as internal solutions at tech companies and were then released as Open Source projects, such is the case with Apache Kafka at LinkedIn, Apache Hadoop at Google (papers published by engineers at Google and implemented by Yahoo! engineers), Apache Storm and Apache Heron (an incubating project) at Twitter, and Apache Hive at Facebook to name a few.