Ever since the term “data scientist” came onto the tech scene, there’s been a cross-generational debate raging, attempting to define and distinguish newly branded data scientists and traditional statisticians. I personally adopted the data scientist title around 2012, and I recall a rather pithy definition float across the Twittersphere around this time:
In a more serious light, data science is often defined as the confluence of three areas: computer science, mathematics/statistics, and specific domain knowledge. Implicit in this definition is the focus on solving specific problems, in contrast with the type of deep understanding that is typical in academic statistics.
In this article, we’ll take yet another look at the data scientist/statistician kerfuffle to see if we can find some common ground and maybe even a common endpoint.
Data Science or Statistics?
It seems that the designation “data scientist” has taken the world by storm. It’s a title that conjures up almost mystical abilities of a person garnering information from deep data lakes with ease. It comes from a belief that a data scientist can wave his or her hand like a 21st century Houdini and effortlessly extract insights from the data.
What’s intriguing about the field of data science is its perceived threat to other disciplines, specifically statistics. I don’t see this threat as real however as the two fields are quite distinct and complementary. In the past decade, it’s clear that though the two fields can exist separately on their own, each is weak without the other. Statisticians need to understand the modeling and structure of data, while data scientists need to understand applied statistics.
It’s no wonder that statisticians feel threatened by data scientists to a certain degree. Statisticians deal with nebulous concepts like point estimates, margins of error, confidence intervals, standard errors, p-values, hypothesis testing, and the proverbial argument between the “frequentists” and “Bayesians.” Statisticians can be viewed as confusing to the general public and many times the statisticians can’t even agree on what is correct.
Data scientists on the other hand, closely follow the “data science process” that is more approachable; data ingest, data transformation, exploratory data analysis, model selection, model evaluation, and data storytelling. Sure, many of these steps follow statistical methods behind the scene, but they’re sealed in a more engaging and understandable wrapper. Many more people can embrace data science.
To be sure, there will always be a need for a solid foundation in statistics. There are many cases where a data scientist would not have a clue what to do with certain data sets without help from someone with a background in statistics. At the same time if a statistician was handed a high-dimensionality data set with 5 billion rows and 10,000 variables, they’d be hard pressed to set-up the data for analysis without consulting a data scientist.
Ultimately, the two disciplines need to find some common ground. It should be part of the curriculum of a statistics department program to teach students how to work with real-world data. And those working in data science need to have the appropriate training in statistics.
Further Comparing and Contrasting
Although data scientists and statisticians tend to gather information for similar purposes, their means of data collection are quite different. On one hand, the amount of data for data scientists is often massive, consequently, they spend a lot of time with tasks like large-scale data ingest, data cleansing and transformation. Conversely, statisticians still rely on more traditional and smaller scale methods of data collection, such as surveys, polls, and experiments.
Typically data science problems are formulated using a modeling process which focuses on the predictive accuracy of the model. Data scientists do this by comparing the predictive accuracy of different machine learning algorithms and selecting the model with the best accuracy. Statisticians take a different approach to building and testing their models. The starting point in statistics is usually a simple model, such as linear regression, where the data is verified to determine whether it is consistent with the assumptions of the model. The model is improved by addressing assumptions in the model that are violated. The modeling process is considered complete when all model assumptions are verified and no assumptions are violated.
While data scientists focus on comparing a number of different methods to create the best machine learning model, statisticians rather work to improve a single, simple model to best fit the data.
Statisticians tend to focus more on quantifying uncertainty than data scientists. As part of the statistical model-building process, it’s common to quantify the connection between the outcome being predicted and each predictor. Any uncertainty about this connection is also quantified. This process is not as common with the tools used by data scientists, namely machine learning.
The two fields also use somewhat different nomenclature to describe the same principles. Data scientists speak of things like: “example” whereas statisticians use “observation,” “feature” versus “predictor” or “independent variable,” “label” versus “response” or “dependent variable.”
In current terms, the fields of data science and statistics differ in a number of ways. The fields differ in modeling processes, the size of data consumed, the types of problems studied, the academic background of the people in the field, and the terminology used. At the same time, the fields are closely related in the sense that both data science and statistics aim to extract knowledge from data.
Given time, the fields of data science and statistics likely will converge to a common end-point. Statisticians have gone about gathering data and performing analysis techniques like linear regressions for several centuries. Eventually, as more statisticians pick up on skills like implementing algorithms that learn from data, and provide predictions and actions and more data scientists pick up on statistical science (sampling, experimental design, confidence intervals, p-values, etc.) the boundary between data scientists and statisticians will eventually blur.