The content in this post was presented as a talk at ODSC East 2020. The slides from that presentation are available here.
Data science is a relatively new field and practicing data scientists come from a wide variety of backgrounds. This is a great thing for the field because the problems being solved with data science are diverse. However, a lot of data science education focuses on teaching algorithms and techniques for small projects that are developed over hours or weeks, and are much smaller in scope than projects developed in industry.
Projects in the industry are highly collaborative and need to stand the test of time. There are a number of situations where work on a project gets difficult such as:
- Your coworker owns a project but is on vacation for a few weeks and you need to run their code for the first time
- You’re hired as a company’s first full-time data scientist and inherit a bunch of work from contractors that are no longer available
- You revisit a project that you developed a year ago and need to rediscover what you did and why
Most technology workers are familiar with the concept of “but it works on my machine,” but it’s a phrase that we definitely want to avoid. By learning some best practices from software engineering, data scientists can write code that is reproducible and understandable. In this post, we’ll briefly describe the concepts of dependency management, version control, and coding standards and how they’ll make your job a lot easier.
Dependency management comes from the idea that it should be easy to install everything you need to run multiple projects on any computer. This is valuable to:
- Work on multiple projects at that same time with different requirements
- Avoid modifying your global Python environment that system processes may depend on
- Make it easy for a new collaborator to start working on your project
- Enable you to fix your installation when you accidentally run a command that breaks it
A dependency is software that is published so that you can use it in your data project. Dependencies that you use may have their own dependencies, so it’s easy for your project to have hundreds of dependencies. For example, in the Python data science world, some common data science dependencies are the libraries NumPy and SciPy.
There are many tools that will help you manage your dependencies. Speaking generally, these work by providing a way for you to specify your dependencies and then installing the dependencies into an isolated execution environment for each of your projects. Explicitly specifying dependencies and keeping project dependencies isolated makes your code a lot more reproducible!
In the Python ecosystem, some of the most popular dependency managers are virtualenv, venv, pyenv, and pipenv. In the R ecosystem, there’s Packrat and renv. There are even some language-agnostic tools such as conda. While every dependency manager has pros and cons, your setup can get complicated if you attempt to use more than one. The important thing is to choose a dependency manager that you like, learn how it works, and then use it consistently every time that you start a new project.
Version control is used to track changes to a set of files over time. The most popular version control system is Git, so it’s highly recommended that you get familiar with it. While Git itself is universal there are many Git services that you will have to choose between when you decide to host your code online. The most popular are GitHub, GitLab, and Bitbucket.
Many people are introduced to version control for the first time when they start working with a team or using open-source software. This can lead to the misconception that version control is only for working with others. But I highly recommend that you always use version control, even if you know you’ll be working on a project alone. The reason for this is that version control can provide context to your code, making it more understandable.
When you make a change to a set of files in Git it’s called a commit. Your commits include commit messages and these are your opportunities to provide context to the changes you’re making. Using commands such as `git log` and `git blame` you should be able to answer questions such as “When was a change made?”, “Why was a change made?”, and “Is there additional context I can look up about this change?” (usually, additional context comes via a link to an issue tracker). There are many great guides to writing good Git commit messages, such as this one by Chris Beams.
Coding standards refer to the practice of establishing rules with your team about how your code is structured and what processes you need to follow. Setting standards will make it easy for you to jump into code or projects that were created by other members of your team. They should also encourage best practices and catch bugs. At a minimum I would create standards around:
– Structuring projects (I’m personally a big fan of DrivenData’s Cookiecutter Data Science template)
– Using linters to automatically check your code
– Writing Git commits
I believe that three three software engineering best practices I’ve presented should be adopted by all data scientists. Use a dependency manager so that your environment is reproducible, version control so that your project’s history is traceable, and coding standards to make it easy to navigate your projects. Manage your data projects like a software engineer and start reaping the benefits!
About the author:
Michael Jalkio is a data engineer at Amazon in San Diego. He works in the Buyer Risk Prevention team, whose mission is to keep Amazon stores safe and trustworthy by protecting customer accounts from takeover, fraud, and abuse. Before joining a “big tech” company he worked on an enterprise data warehouse migration project at Petco, and helped build the data science team at a startup called Classy. He’s most passionate about doing work that makes a positive impact and helps give everyone in the world equal opportunity to do what they love.