Open Source and Data Science, a perfect match
By: Alex Perrier – ODSC data science team contributor
The open source movement has been a force in software development for many years, starting with the A-2 system,1 back in 1953 or the Linux kernel released by Linus Torvalds as freely modifiable source code in 1991. Data science as we know it today with jupyter notebooks and freely available top quality libraries and tools has directly benefited from the open source way.
The diversity of open sourced resources for data science can be intimidating. This is why conferences such as ODSC East (Boston MA May 20-22) which is focused on Open Data Science are so important for the practitioner. With an impressive line up of the most influential actors in open data science, it gives you the chance to discover and learn about new projects and dive further into projects you are already familiar with.
In fact, the catalog of open data science resources is truly impressive. Starting with scientific libraries, in Python: Scikit-learn, NumPy, Pandas, Scipy, astropy, scikit-image, and many others; nearly 7000 R packages on CRAN and 700 Julia packages.
In Deep Learning, we have the recent TensorFlow software by Google (C++), Caffe developed by the Berkeley Vision and Learning Center (C++), Torch which was open sourced by Facebook in 2015 (Lua), Theano from the University of Montreal (Python), and the list goes on and on.2
Data visualization also has its treasure of open source libraries such as ggplot2, matplotlib, bokeh, knitR, D3, plot.ly …, while Big Data has Hadoop and Apache Spark. Tools like Vowpal Wabbit or algorithms like XGboost are popular in Kaggle competitions. And beyond machine learning libraries, today data scientists can base their work on cutting edge data science platforms: the Anaconda suite from Continuum IO used by a large number of python users; H20 an open-source algorithm development platform; Rapidminer a predictive analytics platform and notebooks such as Jupyter, Beaker or Zeppelin which help foster sharing and collaboration among teams.
Moving beyond software, data itself has also been open sourced through the creation of open data platforms: Kaggle datasets, Amazon datasets, Google BigQuery and of course the classic UCI Machine Learning Repository. The aim being to promote research, collaboration and open data science by making these important datasets freely accessible to the public.
Finally, all these scripts and libraries wouldn’t be as accessible without Git and Github.com the protocol and platform at the core of open source software. Github offers standardized features that constrain and shape open source development and foster feedback and collaboration.
These examples are only the tip of the iceberg of the Open Data Science ecosystem and are by no means meant to be exhaustive. There are many many more Machine Learning frameworks, libraries and software in many languages besides python and R: Julia, C, C++, Haskell, Java, Go, .Net.
Free and Open Source software (FOSS) has been know to bring quality, security, flexibility and costs benefits to companies, users and developers for years. But the marriage between open source and data science goes even further.
It is a known fact that FOSS allows developers to contribute in myriad of ways to their favorite project: fixing bugs, running tests, adding comments, improving performances and extending compatibility. With data science tools and particularly data science scientific libraries, data scientists also contribute by suggesting new algorithms and new techniques directly inspired by the most recent and innovative research.
These suggestions follow a peer-reviewed process that is very similar to peer-reviewing for research publications. Not only can people help improve the existing code base they can also participate in its roadmap and influence the future evolution of that library. Although each open source project has its own set of rules, policies and quality guides, most offer ways for people outside the core team to start discussions, offer ideas and make propositions via mailing lists or the github issue tracker. It’s not democracy, it’s not wisdom of crowds, it’s closer to a benevolent dictatorship or ruling by a council of elders.
How are the decisions made, for instance, to include a new algorithm in a library such as scikit-learn? The decision process is in the open and available through the mailing lists:
- Someone suggest a new algorithm and makes a case for it (papers, benchmarks, …),
- A discussion takes place
- The core team makes a final decision
- A developer clones the repo, submits the new code for review, makes sure the code follows scikit’s guidelines (tests, comments, coding guide, coherence), and the code is then reviewed and improved until accepted.
Ben Lorica Chief Data Scientist of O’Reilly Media, sums it well: “Contributions to scikit-learn are required to include narrative examples along with sample scripts that run on small data sets. Besides good documentation there are other core tenets that guide the community’s overall commitment to quality and usability: the global API is safeguarded, all public API’s are well documented, and when appropriate contributors are encouraged to expand the coverage of unit tests.”3
This crowd based suggestion process shortens time to implement new techniques. The peer review process ensures that the improvements are real improvements that meet the overall vision of the tool, and are scientifically sound.
Open sourcing is not just about the code or the license it’s about opening the conversations, decision process and guiding evolutions. Peer reviewing brings forth the science in Data Science and faster implementation of crowdsourced innovations.
Why open source?
Some open source projects are community driven (scikit-learn for instance) and backed by institutions while others are driven by companies. Some projects are open sourced from the beginning while others are kept proprietary for awhile until the software is open sourced in order to maximize financial support.
In recent months we’ve seen many such libraries being open sourced. From big actors like Google opensourcing TensorFlow to smaller companies such as Plot.ly. Why is that?
The benefit of open sourcing is well illustrated in Plotly’s announcement:4 “We’re big fans of collaboration, freedom, and perpetual motion. Open-source has become the de facto distribution for gold-standard scientific … software. […] By open-sourcing Plotly’s core technology, everyone benefits from peer-review and Plotly’s products will continue to be the most cutting-edge offering for exploratory visualization. Plotly.js has the quality, accessibility, and scope to be the charting standard for the Web, […]” (emphasis are my own).
In short, open sourcing the code allows the product to become a standard at the forefront of innovation. Nice!
For a domain such as Data Science that strives on collaboration, open source does not stop with tools, packages and libraries. Open Data science benefits from open sourced Books, MOOCs, and machine learning courses, videos, blogs and social platforms. And the discussions and collaborations continue at meetups and conferences.
Offline interactions, face-to-face and group discussions are necessary for knowledge to spread, ideas to sprout and collaborations to flourish.
This where conferences like ODSC, Scypy, R world, Hadoop, NIPS and many others come in.
ODSC East showcases the best and the brightest when it comes to data science, including many open source data science pioneers:
- Scikit-learn workshop with Andreas Muller
- Kirk Borne on of the most influential data scientist will talk about Open data for Social Good
- Stefan Karpinski co-creator of the Julia programming language
- Ingo Mierswa founder of RapidMiner will deliver the keynote
- Peter Wang CTO and co-founder at Continuum on visualization of Large data with Bokeh
- Eric Novick, Founder & CEO of Stan Group Inc on Probalistic Programming with Stan
- Max Kuhn Director of Nonclinical Statistics in Pfizer R&D and R expert on Rule-Based Models
- Allen Downey author of many open sourced data science books will talk about Bayesian Statistics
- JJ Allaire Founder and CEO at RStudio on R Markdown
- Kaz Sato on Tensorflow
and so on. The list of speakers is truly amazing. ODSC East in May 2016 is a fantastic opportunity to learn more about the open source tools that fuel the data science revolution. Don’t miss it.
To read more from Alex sign up for our newsletter or follow him on twitter @alexip.