Standard software development practices for web, Saas, and industrial environments tend to focus on maintainability, code quality, robustness, and performance.
Scientific programing in data science is more concerned with exploration, experimentation, making demos, collaborating, and sharing results.
It is this very need for experiments, explorations, and collaborations that is addressed by notebooks for scientific computing. Notebooks are collaborative web-based environments for data exploration and visualization — the perfect toolbox for data science.
In your favorite browser, you can run code, create figures, explain your thought process and publish your results. Notebooks help create reproducible, shareable, collaborative computational narratives.
Fast forward 15 years: IPython was just a toddler of a few hundred lines of code, when SageMath became available as a free and open source environment for scientific computing.
The past few years haver seen the rise of IPython and its evolution into the Jupyter Project, as well as the emergence of new notebooks, Beaker and Zeppelin. In this article we look at what distinguishes these notebooks and how mature they are.
SageMath: The First Open Source Notebook
The Sage Notebook was released on 24 February 2005 by William Stein now a professor of mathematics at the University of Washington, as free and open source software (GNU License), with the initial goals of creating an “open source alternative to Magma, Maple, Mathematica, and MATLAB.”
Sage is based on Python and Cython and focuses on mathematical worksheets. The Sage Notebook recently moved to the cloud with SageMathCloud, thanks to a collaboration with Google’s cloud services. Sage Notebooks can also be downloaded and run on your local machine.
Although not as popular as it ought to be, the Sage Notebook is a free and entirely open source alternative to Mathematica and Matlab that supports Python, LaTeX, Markdown, task lists, R, IPython Notebooks and allows to manage courses, write C programs, make chatrooms, and create Sage worksheets for sophisticated mathematics.
The Sage Cloud Notebook can be seen as a hosted version of a platform similar to Matlab or Mathematica that can also embed IPython notebooks. Its math scripting language is as good as its competitors if not better in some cases. And the Sage community is large, deep, and wide with its own stackoverflow type website.
The Sage Cloud Notebook works like a charm right out of the box. There’s nothing to install and performances are decent. To test, I simply copied a couple of scikit-learn tutorials into a Sage Notebook, had no issue, got the results as fast as expected, and was able to publish the notebook right away.
Sage Cloud is freemium based. With a free plan of 3Gb of storage and 1Gb of memory and 8Gb / 50 Gb for $7/month it offers a pretty competitive deal.
It’s a stable environment with few bugs and many online resources and advanced math tutorials.
However, with only 20 recent contributors and over 80% of the commits done by William Stein himself, the Sage Notebook could benefit from a larger community of contributors. A bit of visual revamping and better marketing would also go a long way to boost its adoption rates.
Jupyter Notebooks, formerly known as IPython Notebooks, have enjoyed a rather impressive success and steady growth since 2011. In the past year, the number of ipynb files on github has nearly tripled from 80,000 to over 230,000 files.
Following the publication of an article on IPython in Nature in Nov 2014, Interactive notebooks: Sharing the code, Rackspace, the company hosting IPython Notebooks, had to ramp up and serve more than 20,000 IPython notebooks.
The IPython console was started by Fernando Perez circa 2001. From a first attempt to replicate a Mathematica Notebook with 259 lines code, to the first presentation of the IPython Notebook at EuroSciPy 2011 conference in Paris, IPython had multiple false starts, diverse external influences and time to mature.
With the Sage Notebook being a reference all along, Fernando Perez had many collaborations with the Sage team. A detailed history of IPython can be found on Fernando Perez Blog
Comparing Google trends for SageMath (red), IPython (Green) and Apache Zeppelin (blue). Note the hockey stick for IPython after the EuroSciPy 2011 conference.
In 2015, the IPython Notebook project became the Jupyter project, an ambitious project whose goal is to lay the foundation for a generation of scientific publications focused on reproducibility by making the data and the code accessible and open.
The project vision is presented in Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. See also Brian Granger’s keynote address at ODSC West. As Safia Abdalla writer for Opensource.com puts it “The Jupyter Notebook hints at what the academic journals of tomorrow will look like and paints a promising picture. They will be interactive, visualization-focused, user-friendly, and include code and data as first-class citizens.”
The ability to go beyond Python and run several languages in a notebook is also at the center of the Jupyter rebirth. Multilingualism is still limited by notebook. It is not possible to have multiple cells with multiple languages within the same notebook. Furthermore, in order to run notebooks in languages other than Python, you still need to install additional kernels. See this article for a detailed walkthrough of Implementing an R kernel.
The Jupyter Project benefits from a large community of contributors, partnerships with many companies (Rackspace, Microsoft, Continuum Analytics, Google, Github, …) and universities (UC Berkeley, George Washington University, NYU, …)
There’s a New Notebook in Town
In fact not one but two new notebooks have recently blipped on the data scientist radar: the Apache Zeppelin Notebook and the Beaker Notebook.
The Zeppelin Notebook
The Zeppelin Notebook is supported and incubated by the Apache software foundation with Lee Moon Soo as its lead developer. It is similar in concept to the Jupyter Notebook with several noticeable differences.
Apache Zeppelin is build on the JVM while Jupyter is built on Python. This can be an obstacle for someone not familiar with the Java ecosystem. (my case in fact). Since the project is in its infancy, it does not yet offer a binary install file. You have to clone the github repository and build it from the source with dependencies such as Open JDK, Maven and Node. The main problem people encounter during the install seem to be with the web front end installation.
Zeppelin offers the possibility to mix languages across cells. Zeppelin currently supports Scala (with Apache Spark), Python (with Apache Spark), SparkSQL, Hive, Markdown and Shell. And you can make your own language interpreter.
Apache means Apache Spark. Zeppelin is fully oriented for Spark. It is data exploration and visualization intended for big data and large scale projects. Of course you can use pyspark in a Jupyter Notebook, but Zeppelin is natively Spark. Being part of the Apache ecosystem does not hurt either.
The content of one cell is modified downstream by another cell. Fun! See a simple demo by Lee Moon Soo of Zeppelin Angular display system. Zeppelin runs a scala interpreter (REPL), which gives you direct access to the DOM.
It also has an angular interpreter which allows you to build great UIs and import whole chunks of web technologies.
This Zeppelin Notebook offers a good sample of Visualizations In Zeppelin. Although still pretty much in beta, the Zeppelin Notebook enjoys quite a buzz and has a rather large community.
The roadmap for version 0.6 includes R support with the implementation of the sparkR interpreter, better Python and angular repl and job management, among other things.
- A Zeppelin overview
- Another great notebook example: Analyzing network intrusion dataset with Python and Spark
- Data Science Lifecycle with Zeppelin and Spark
The Beaker Notebook
See for instance this translation notebook example where different types of variables (scalar, dataframes, images, tables, etc.) are set and accessed in Python, R, JS, and Groovy.
The Beaker Notebook app runs on Mac, Linux, and Windows, and the install is straightforward. It comes bundled with the usual suspects, markdown, latex and Python. Other languages must be hooked up to be recognized by the notebook which can be challenging and time consuming.
Installing a piece of software can always be difficult. This might also be true of Zeppelin or Jupyter Notebooks. What was troubling with the Beaker Notebook was the absence of a vibrant online community. Issues and problems are mostly dealt with through the github repository issues.
And with only a mere 90 questions tagged with ‘beaker‘ on Stack overflow and no mailing list for the users or the developers, troubleshooting your install can be a bit problematic.
There are not many Beaker Notebooks available yet besides the ones on the beaker publication server. The Beaker Notebook is a great concept that needs a bit of traction and love from the community to take off. It will be interesting to see how more real-world projects and datasets will fare on this platform.
Will the language adaptors start hindering the notebook performances for larger datasets?
- Better interactive data science with Beaker and Rodeo
- Minegraph, a beaker notebook with Python & d3.js tools for analyzing and visualizing minecraft worlds.
You can use IPython with pyspark or use Python in a Zeppelin Notebook. You can mix Py and Rb and R in Beaker or do some complex math in Sage Notebooks and then import a Jupyter notebook.
The possibilities are endless and the momentum is here for notebooks to evolve towards language agnosticism, bigger distributed projects, and better visualization flows. Jupyter is an amazing project that feeds and rides the rising wave of data science.
And although it’s hugely popular, there is still room for other amazing projects that will in turn inspire Jupyter.
As with any aspect of data science and machine learning, it’s an exciting time to be able to witness these fantastic scientific computing projects come alive.
Lead Data Scientist focused on Natural Language Processing and Predictive Modeling, a background in stochastic processes and signal processing and extensive experience in agile software development. I recently authored a book on AWS Machine Learning with Packt Pub. I am a creative start-up co-founder with clear communication skills, project management and business development experience. Team lead and team builder.
- Modeling Regression Trees 202 views | by Diego Lopez Yse | under Machine Learning, Modeling
- Upcoming Live Training: NLP Fundamentals with Leonardo De Marchi 62 views | by ODSC Team | under Featured Post, NLP/Text Analytics
- An MBA’s Guide to Breaking Into Data Science 39 views | by ODSC Community | under Career Insights, Guest contributor