Editor’s note: Francesco is a speaker for ODSC West 2021. Be sure to check out his talk, “Reproducibility and Dependencies for Jupyter Notebooks,” there!
Even though many developers (including data scientists) focus on their core problems when working on their experiments, one basic aspect can make these projects not reusable. We are not considering anything machine learning-related yet.
One of the first steps during the development of a project is the selection of libraries or dependencies. When someone runs pip install <package-name>, they might not be aware that along with the library that is going to be installed, so-called direct dependency, many other dependencies will be installed on your machine, so-called transitive dependencies. Any change in one of those dependencies can break your experiment. It’s fundamental to have a way to state all the dependencies used, including the operating system, python interpreter, and hardware used to run a certain experiment.
Jupyter Notebooks are by default not stand alone, you need to provide the environment and the packages in which they run, such as requirements.txt or Pipfile.lock or another manifest file. When someone receives these notebooks they have to set the environment again using those manifests.
Dependency management is one of the most important requirements for reproducibility. Having dependencies clearly stated allows portability of notebooks, so they can be shared safely with others, reused in other projects, or simply reproduced.
Project Thoth supports the developers in keeping dependencies up to date by giving recommendations through the developer’s daily tools. Thanks to this service, developers (including data scientists) do not have to worry about managing the dependencies after they are selected, since conflicts can be handled by Thoth services. Having this AI support can bring benefits to all data science projects, offering improvements such as increased performance due to optimized dependencies and additional security since insecure libraries cannot be introduced.
Within the different Thoth integrations, there is one created for JupyterLab extension for dependency management, which is called jupyterlab-requirements.
You can use this extension for each of your notebooks to guarantee they have the correct dependencies. This extension is able to add/remove dependencies, lock them and store them in the notebook metadata. In this way, all the dependencies information required to repeat the environment are shipped with the Jupyter notebook.
We can say that the Jupyter notebook is stand-alone now and you can share it safely with others without additional files to be provided. In this way, the risk of updating a package in those manifests is no more an issue. Who receives the notebook can just run a command with this extension and the environment will be ready for them to perform the experiment.
In particular, the following notebook metadata are created for you, when you use Thoth’s dependency management tool:
- requirements (Pipfile);
- Requirements locked with all versions and hashes of libraries (direct and transitive ones) (Pipfile.lock);
- Dependency resolution engine used (Thoth or Pipenv);
- Configuration file for dependency resolver (only for Thoth resolution engine).
All this information can allow reproducibility and shareability of the notebook.
- start working on a new notebook
- create dependencies for your existing notebook
- convert notebook that uses pip commands in cells
- use a reproducible notebook
About the Author/ODSC West 2021 Speaker on Jupyter Notebooks:
Francesco Murdaca is a senior data scientist/software engineer at Red Hat and he is part of the AI Centre of Excellence (AICoE) and Office of the CTO. He works with the Thoth team, where they created a recommender system to help developers (including data scientists) focus on important problems offloading many automated tasks performed by pipelines and bots. He has a passion for AI, software, and space. He previously worked on a research project with the European Space Agency (ESA) and industrial partners mixing AI and space to create a recommender system for the design of satellites. He loves to read, travel, and learn languages.