For other data scientists to improve, build on, or even just trust your analysis, they need to be able to reproduce it. Even if you have shared code and data, reproducing your analysis may be difficult: which code was executed against which data in what order? And even if the steps are clear, rerunning downstream steps to see your new results after changes upstream can be a tedious process.
This talk will demonstrate the workflow and tools we used to increase our productivity and enjoyment by reducing grunt work and making it easier to build on each other’s work. We used GNU Make as a clear way to represent what each step does, the inputs it depends on, and the output it produces. As we iterate on our analysis, makefiles allow us to conveniently execute only the steps that depend on code or other inputs that have changed since the last run. I’ll walk through an example of creating a project, adding each step as a modular script, and reusing these scripts in different contexts. Since interactive exploration (and debugging) is a big part of data science, I’ll demonstrate techniques for conveniently going back and forth between batch execution via makefiles and working interactively.
After leaving graduate school (where he worked on theorems about machine learning on manifolds), David built loss models for homeowners and auto insurance. David is interested in ways to extract insight from complicated/non-parametric models. At Kaggle, he’s organized a number of predictive modeling competitions and worked with the energy industry to optimize performance from unconventional reservoirs. He’s now working to improve the Kaggle platform.
David is passionate about tools to help data scientists collaborate effectively, iterate quickly, create reproducible analyses, communicate results effectively, and build on each others’ work.