On the Code of Data Science


The recent ODSC UK conference welcomed Gael Varoquaux, a Computer Science Researcher and key contributor to scikit-learn, to give a keynote address. His speech had a distinctly Pythonic flavor, just as he had warned it would.

Mr. Varoquaux began with promoting the integration of common software development methodologies and best practices into the Data Science workflow. He championed consolidation tools such as a debugger, a code editor with linting, version control, and minimizing compute time with caching such as that provided by the joblib library. On the other end of the spectrum, however, he warned against premature over-engineering.

The topic then switched to Machine Learning from the perspective of the newest version of scikit-learn, 0.18. Most of this section covered how to leverage the library to work with big data. This included online learning with the partial_fit method, data consolidation with the Feature Agglomeration algorithm and others, and the availability of outlier detection algorithms.

The remarkably informative keynote ended with a preview of the cool additions which will be in the next version of scikit-learn. The best way to sit out the wait is to take the wealth of information provided by Mr. Varoquaux, and use it to beef up your own Data Science and Machine Learning workflows.