The 5 Biggest Debates in Data Science Today The 5 Biggest Debates in Data Science Today
The meteoric rise of data science in recent years is not without controversy. There are a number of on-going debates in... The 5 Biggest Debates in Data Science Today

The meteoric rise of data science in recent years is not without controversy. There are a number of on-going debates in the discipline that have gone unresolved for quite some time. The short list below contains the most common debates I routinely see discussed online, at conferences, and even in my own small personal circle of practicing data scientists. The list appears in no particular order of importance. I think they’re all pretty important and compelling. Let’s dive in!

Debate #1 – Data Science and Privacy

This is the debate that gets the most mainstream press these days. I see articles about the wide-ranging privacy implications of big data and AI in all sorts of publications including newspapers, magazines, blogs, not to mention social media sources. The most prevalent discussions focus on the misuse of data by the tech industry’s largest players: Facebook, Google, Amazon, and others. We’re still seeing the impact of the Facebook/Cambridge Analytica scandal surrounding the U.S. presidential election in 2016. New Facebook scandals keep popping up as it becomes clear how much personal data the company has made available to other tech companies without explicit permission by its users. 

These issues around data science and privacy will only become more prominent in the years ahead. Due to security trepidations, government agencies are installing cameras in more locations. Advances in facial recognition technology, driven largely by deep learning, will make video data more searchable. This opens up many privacy concerns that confront dystopian goals of countries like China that has an estimated 200 million surveillance cameras directed at its citizens. There’s even a new annual holiday every January 28 to address this debate – Data Privacy Day. This debate surrounding data and how data scientists are using it is just getting started.

Debate #2 – R versus Python

This is the debate that never ends. With almost religious fervor, those who code in either language defend their tool of choice with intense affectation. I’ve seen months- or years-long discussions on LinkedIn, Stack Overflow, and other technology sites. The debate quiets down for a while but then someone fans the flames and it starts years-longs up again. R and Python remain the two most popular languages used by data scientists.

Coming from academia, my early data science language was R because that’s what my professors used, and I stuck with it for years. But given the rise in popularity of Python, especially in the deep learning community, I gave in, so now I regularly use both languages in my work as a data scientist. This is what I recommend to all data scientists today – regardless of what language you started with, just pick up the other and use both R and Python. Scala, or Julia anyone?

[Related Article: Jupyter Notebook: Python or R—Or Both?]

Debate #3 – Is data science just a rebranding of statistics?

This debate is somewhat personal to me since my academic background is in computer science and applied statistics. I worked as a data scientist during an era when there was no term “data science.” To me, data science existed for as long as computers have been around, and AI thought dates back to the 1950s. So in a sense “data science” is indeed a rebranding, but as a data scientist, I like it a lot. Previously, when I found myself at a dinner party and someone asked what I did for a living, I had a difficult time explaining my work – “Well, I use computer science, applied mathematics, statistics, probability theory, etc.” It was at this point the person slithered away with an awkward look on their face. Now, I can declare “I am a data scientist,” and many people have a rough idea of what it is.

A few years ago, my alma mater UCLA was planning a Master’s Degree program for students wanting to go into data science. After much discussion, the administration decided to use the name “Master of Applied Statistics” instead of “Master of Data Science.” I think that was a good decision in order to give the program longevity. The term “data science” could be a fad, but the underlying disciplines won’t change. This was a conscious decision not to rebrand statistics.

Debate #4 – Who can deliver the best results – data scientists or domain experts?

This often passionate debate is not whether data scientists can deliver effective business solutions, but rather whether domain experts play a significant role in the delivery of such solutions. To me, this debate is kind of nonsensical because these two designations are symbiotic. Data scientists absolutely need domain experts, unless of course, the data scientist has a specific background in a particular domain. This is frequently the case, since we’re seeing a lot of data scientists transitioning into the field from other disciplines. I know several psychology PhDs who are also accomplished data scientists, so if psychology is the problem domain, then you’ve got an all-in-one solution.

On the flip side, it is also the case that domain experts need data scientists. I mean, there’s only so much you can do with Excel!

The debate may have started with the recognition that Kaggle has repeatedly shown that accurate machine learning solutions can be built and tested for performance without the participation of domain experts. Most Kagglers don’t have domain experience for the challenges in which they compete. Further, most Kaggle competitions are won through creative feature engineering that may or may not involve domain experts. Of course, the main counter-argument to these success stories is that in many of these competitions, the domain experts had provided the initial business hypothesis by asking the appropriate questions and preparing the data.

In my own data science projects, I work with domain experts all the time. I wouldn’t even think to approach a project without access to people who are experts in the business behind the proposed problem.

Debate #5 – Is data science dead?

I really don’t get this one, and I see this question being debated more and more lately. It’s like saying that computer science, and applied statistics is dead. No, they’re not. But I think the genesis of this debate is based on the fear that the data science profession is being overly commoditized and that our value isn’t properly being communicated to the enterprise thought-leaders who hire us.

[Related Article: The Best Machine Learning Research of 2019 So Far]

Here is a good example of one discussion happening on LinkedIn: “Data Science Dead in 5 Years or Less.” The author presents 5 observations for why he believes data science is going the way of the Dodo. I actually agree with most of the points presented, I just don’t agree with the conclusion. Instead of dying, I think data science is simply evolving past all the original hype. That’s probably a good thing, but I don’t think the essence of data science is going anywhere and will be with us for a long time.

Ready to get exposure to all of these debates and more? Attend ODSC East 2020 this April 13-17 in Boston!

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.