My son, Jack, loves Jif peanut butter. Correction: Jack loves creamy, Jif peanut butter in a large plastic tub. I once went to the grocery store and bought 5 different kinds of peanut butter for a blind taste test. For some reason, I wanted to prove to him that other brands of peanut butter were just as good. The results of the blind tasting were unanimous. Everyone in the family – all 4 kids, my wife, and myself – we all liked the Publix store brand peanut butter better than all the others. So, naturally, you can guess what type of peanut butter is in my pantry: Jif. Creamy. Large plastic tub.
I suppose a lot of things in life are like that. Either you’ve been doing them the same way for so long that you don’t notice the bad parts anymore, or you’ve somehow managed to convince yourself that those bad parts aren’t really that bad. Do you even deserve to be happy in the first place?
I started out my career in data science at the University of Alabama, which happens to be one of the biggest academic supporters of SAS, so I learned Base SAS. From PROC SQL to DATA steps, I know Base SAS backward and forwards. I tried to switch to Enterprise Guide or Enterprise Miner a few times (If you’re ever feeling too happy, try to train a random forest in eMiner.), but the UI is hilariously bad, and I could never make it do exactly what I wanted, so back to the code editor.
I also spent a fair bit of time in R Studio, which is a definite step up from SAS (and doesn’t cost a fortune) – tons of open-source packages and much better visualizations.
When I joined DataRobot in 2015, I was introduced to Python and Jupyter Lab. It was immediately obvious to me that Python is much more flexible (and useful) than either R or SAS. Occasionally, when I have a cold, I’ll have a SAS fever dream, but most of my data science work today is in Python.
Data science development is a unique mix of heavy compute, data engineering, data visualization, and a lot more. That means that most of the tools for doing data science have been built for that specific purpose, usually by academics. Unfortunately, every one of the data science development environments that exist – even the newest and most modern – has given the data science community a severe case of Stockholm syndrome. Years of abuse have resulted in data scientists so entrenched in their “preferred” toolkit that change can be unthinkable.
While cloud computing has brought new innovations, even these new tools have been built on a flawed foundation. I’d like to point out some of those flaws here. For this post, I’ll be focusing on code-based data science tools. While the promise of GUI-based development has been repeatedly (over)stated, pretty much everyone has learned that these tools lack the breadth, flexibility, transparency, and usability for most real problems. So, for your consideration, here are 4 reasons that data science development tools suck, and a roadmap for how to fix it.
Notebooks are glorified scratch pads
It’s no coincidence that notebooks are called notebooks. Jupyter notebooks were invented at Berkeley (they were called iPython notebooks back in the day) by Fernando Perez. He later said:
I thought maybe I could build a small tool that would make that process of running a bit of code, maybe plotting, visualizing some data, continuing to write code based on what I’m looking at in the figure, to open a data file — that exploratory process — easier.
He (and all the many others that have contributed to Jupyter), have done precisely what they set out to do. They built a very slick, sophisticated scratch pad. Want to run a little bit of code to make a scatterplot? Jupyter is amazing at this.
The first time I ever showed a Jupyter notebook to a software engineer, though, their reaction was a shock that it works the way that it does. Consider this simplified example: Suppose you have an integer, called a, stored in memory, and you write this line of code in a new cell in a notebook:
Naturally, when you run this code the value of a will be incremented by 1. Suppose you run that same cell again? It happens again. So the value of a, then, depends on how many times you run that cell. Suppose your kernel crashes, or you restart your computer, or you share your notebook with someone else. Particularly if you used your notebook in an “interactive and exploratory” way (as it was intended to be used!), you are very likely to have inconsistent results between runs. In this simplified example, such a mistake would be easy to track down, but in a long notebook with complex feature engineering and model training, such issues can be much more subtle.
This is one of the main reasons that notebooks aren’t suitable for production code. They encourage you to do foolish things and can easily produce inconsistent results depending on how you run them.
Of course, there’s an easy fix. Whenever weird things start happening, just kill your kernel and run the entire notebook from scratch. (There’s even a “kill my kernel” button in most notebooks, if that tells you anything.) Hope it didn’t take too long to train that neural network! If I had a dollar for every time I’ve done this, I’d have a lot more attachments for my tractor, I can tell you that.
We shouldn’t have to work this way. You can have an environment that allows you to explore and interact with your data without sacrificing consistency and safety.
The foundation of notebooks is obsolete
My first job out of school was at Regions Financial. I was helping them build commercial credit scoring models using SAS. The datasets weren’t that big, but the infrastructure was pretty slow, so I found myself doing a lot of waiting. Why weren’t you multitasking, Greg?! Well, because the tools are all single-threaded. Weirdly, the most common tools today still are. First, I tried opening up multiple instances of SAS on my desktop. Not only did that take the resources of the computer, but it also introduced some really weird, non-repeatable software bugs that I never did figure out. My solution was to get a second desktop. So I had two keyboards, two mice, two monitors, and two towers. It was comical, to be honest. Impossibly, data science tools still behave this way. It’s absurd.
It’s actually absurd on so many levels. Not only are the tools single-threaded, but an awful lot of data science done today happens locally instead of in the cloud. The last group project that I worked on saw us doing insane things like managing notebook versions on Github or zipping and emailing folders or passing work back and forth via Slack. The latest cloud notebooks have at least solved the collaboration problem, but most of them still don’t allow basic simultaneous editing (like Google Docs), and they’re all still single-threaded.
Imagine a world where you could train a model at the same time as you do other exploration or a world where you could train more than one model at a time. Today if you want to do that, you really have three options: First, buy a premade tool; second, give up on interactive mode by using scripting; or third, do multiprocessing through multiple browser tabs. None of these are any way to live.
Data science development is one thing, but at some point, the development comes to an end. Then, the real hijinks begin. After development comes sharing, and the way that’s done today (for good or ill) is through PowerPoint. You can create all the fancy app builders and dashboards, but at the end of the day, corporate communication happens in PowerPoint (or, if you’re truly sophisticated, Google Slides).
After PowerPoint, you may have to contend with IT to actually deploy your work. If you’re very lucky, and your project isn’t too complicated, and your company’s infrastructure is relatively up to date, and InfoSec allows you to do it, then maybe you can get some IT person to start working on recreating your notebook in some sort of stable environment that can actually be reliably deployed.
With the tools that are available today, no matter what you end up doing with your work, you’re almost certainly going to end up copy/pasting it into something else, and that works great as long as you don’t discover any bugs or receive any new ideas or change requests. I suppose it’s possible that one day the end-to-end tools (will exist and) will be able to flexibly handle end-to-end business problems, but they certainly don’t today.
Automation is not for dummies
I think it’s been fairly well established that the Citizen Data Scientist is the second cousin to the Abominable Snowman, having only ever been captured in blurry photographs. Back when we were inventing automated machine learning at DataRobot, we had high hopes that we could usher in hordes of business analysts into the world of advanced machine learning. Don’t get me wrong, the software can absolutely do this. The technology is there. It’s the people that aren’t. We spent more time trying to help companies figure out what they should be doing with machine learning than we did actually doing machine learning. It was a very frustrating process that continues to this day in every company trying to sell automated machine learning software.
The fact is that automated machine learning and lots of other automation tools should have been targeted at data scientists. The opportunity to automate big parts of the data science development process still exists, but nobody has yet taken advantage of it. It cannot be accomplished by abstracting away the coding or by spinning up lots of workers to take care of all the technical stuff. Models built this way usually aren’t flexible enough or transparent enough to work for most data science practitioners, and experience has shown me, anyway, that practitioners are the only ones actually doing this stuff.
Conclusion on Improving Data Science Development
Imagine a data science development experience that was truly collaborative where multiple data scientists could work on the same project at the same time. Imagine a world where the data scientist could orchestrate multiple threads to accomplish multiple tasks at the same time. Imagine a world where the development environment actually enabled data scientists to communicate their results with the tools they’re already comfortable using. Much of the innovation in the data science development experience space has been wasted on solving the wrong problems and has been focused on the wrong people. Startups have built bloated, inflexible ecosystems using all the latest buzzwords but still aren’t reaching the only people that are actually doing data science. Zerve.ai solves all these problems. We are revolutionizing the data science developer experience for good. To be a part of our launch, email firstname.lastname@example.org.