Call it the data scientist’s curse, but every practitioner has had a data science project that became unmanageable at some point because of poor organizational choices early on. We’ve all been at our desks at 2 a.m. changing values and re-running our scripts for the 80th time in an hour, asking ourselves where it all went wrong. It’s really easy to dig yourself a hole you can’t climb out of because of bad choices during the planning phase of your project.
It’s All About the Files
As boring as it is, one of the easiest ways to shoot yourself in the foot early on is to ignore your directory structure and just start dumping your data and scripts anywhere they’ll fit. If that’s a bad start, you’ll blow your whole leg off if you start throwing files everywhere while rushing a project out the door.
Start a data science project by setting up your workspace in a way that makes sense for what you’re trying to accomplish, and really think about it. Maybe your top-level directory will include just your Jupyter notebooks, while a separate folder is reserved for your data and another for data dictionaries and notes.
That’s just one way of slicing it. What I do is create a data directory with input, intermediary and output subdirectories. The intermediary directory is just a place to dump files every time I have to write out my data during a different part of a script. This way, if I have to write a file out halfway through the project, I don’t lose track of where it went or confuse it with anything I’m attempting to read in or write out as a final output.
In my input folder, I’ll create separate directories for each of the datasets I include, just in case there are multiple files in a collection. This can get messy if your datasets use long names, but it’s one of the easiest ways to keep everything separated out nicely.
Be a Better Steward of Your Environment
Environment variables are a personal favorite way of accessing values that need to be used repeatedly, especially when it’s information you don’t want embedded in a script.
For example, if I have an API key that I’d rather others didn’t have access to, this is one of the best ways to make sure that they never see it. I can include my key as a variable in my user-specific environment, meaning no one else will be able to read or see it. Not to mention the fact that it’s a much, much more elegant solution than storing it in a specific text file your script accesses. I’ve seen this in practice and it is not recommended. Using environment variables will make your life much easier in the long run.
I cannot stress this one enough: teach yourself to think in terms of efficient solutions, because it will save your skin when you start working with massive datasets. Brute forcing a problem rarely works in the real world. There have been too many times when I wrote something that should have worked, only to have my laptop begin sweating under the pressure from trying to run multiple sub-recursions or for loops. That’s when I start sweating too, because I realize I’m not going to have my project in on time.
Don’t do that. Make sure that what you’re doing is time-efficient and space-efficient as much as possible. When dealing with large datasets, this will save you from wasting tons of downtime twiddling your thumbs while a script runs. Additionally, if your script is taking that long to run, it’s likely that your code is a rat’s nest and you’ll be lost if you revisit it months later.
One last tip: break out large computations into separate modules within the script. If I have a long pipeline, I’ll write out the intermediary results of expensive computations to disk, then check in my script if that file already exists so I don’t need to process it again. The Python pseudocode might look something like this:
import os def phase_one(): if(os.path.isfile('phase_one_output.file')): return #Code for phase one goes here def phase_two(): if(os.path.isfile('phase_two_output.file')): return #Code for phase two goes here def phase_three(): if(os.path.isfile('phase_three_output.file')): return #Code for phase three goes here def main(): phase_one() phase_two() phase_three() If __name__ == '__main__': main()
These are just some quick tips for organizing your next data science project to sidestep a lot of the problems that sloppy work will create. Listen to Malone’s talk at ODSC 2018 West for a more in-depth explainer on how to take your data science projects and turn them into useful software for the long run.