Data Science Notebooks | 2020 Review Data Science Notebooks | 2020 Review
2020 was a roller coaster, but the data science community is going strong. Interest in the data science domain has grown... Data Science Notebooks | 2020 Review

2020 was a roller coaster, but the data science community is going strong. Interest in the data science domain has grown in the past year yet again. We dug into the data to learn more about the current state of a vital part of the data science ecosystem: the notebooks.

The analyses in this article were built using a Deepnote notebook. Click the link to run our code and explore the data yourself.

This article on data science notebooks consists of 5 key sections:

1. First, we explore key stats and trends of Jupyter notebooks on GitHub.
2. We double down on popular Python libraries and show you what libraries to add to your toolkit for plotting, ML, NLP, and other use cases.
3. We explore search trends from Google & YouTube.
4. Data sources and how you can build on top of them.
5. Conclusion & ideas for future research.

1. Notebooks on GitHub

First, we analyzed repositories containing Jupyter Notebooks on Github. Here are some general stats for 2020:

– Number of created repositories containing Jupyter Notebooks: 10,176
– Number of commits: 13,1753
– Number of issues: 51,887
– Number of discussions: 101

Most popular notebook repos

Love is love — heart or stars, it’s all the same. The most starred repository of 2020 is Fast AI’s Fastbook with 11k stars, 39 contributors, and 3.4k forks. The repo covers an introduction to deep learning, fast.ai, and PyTorch.

Fast AI wins the second spot too with the FastPages repository — a blogging platform (2k stars, 91 contributors, 409 forks).

Image for post

The all-time favorite repository with 27.3k stars is Python Data Science Handbook created in 2017. This one contains the entire Python Data Science Handbook, in the form of Jupyter notebooks.

Most used Python versions

Since Python is the most popular language in Jupyter notebooks, we found a variety of versions used during our analysis. Python 3.6 is the most used Python version with over 55% of users, followed by Python 3.7 at 36.5%. Python 3.5 and 2.7 only have around 0.5% of users each.

Image for post


Here’s the open-source license distribution — MIT License takes the lead, followed by Apache 2.0.

Image for post

Most common kernel names

This one is a bit technical. The most common kernels are variations of python3, but we also see a heavy use of conda.

Image for post

2. Doubling down on popular Python libraries

We’ve looked at the library popularity. In our small GitHub 2020 dataset, we’ve found matplotlib, numpy, and pandas the most popular.

Image for post

To provide a more in-depth view, we’ve analyzed the Datalore 10M dataset and found very similar results. No surprise here, numpy, pandas, and matplotlib are the top 3 most-imported libraries.

Take a look at the top 20 libraries overall:

Image for post

In the sections below, we categorize the most used Python libraries across different subject areas. Which of these will you add to your toolkit in 2021?

Plotting libraries

Matplotlib is the most popular plotting library, with a clear lead over seaborn and plotly.

Image for post

ML libraries

In machine learning, TensorFlow has been the most popular library with 40.0% of users importing it for their ML tasks, closely followed by keras at 34.1%.

Image for post

NLP libraries

For natural language processing, nltk is the clear #1 with 63.0% of imports.

Image for post

Geospatial analysis libraries

For geospatial analyses, folium has been the most popular library, followed by geopandas and shapely.

Image for post

Compression libraries

Zipfile takes #1 with 48.4% of users importing the library for their compression tasks.

Image for post

Other libraries

Here’s a look at a couple of other subject-specific libraries gaining popularity in notebooks:

– Chemistry: pymatgen
– Medical imaging: nibabel
– Astronomy: astropy

It’s important to note that many researchers from similar domains are still not using notebooks, or Python for that matter.

3. Notebooks in search

What does Google search reveal?

We’ve also had a look at what people have been searching for in relation to notebooks in 2020. Top 10 search queries ask about Jupyter notebooks, Python and .ipynb file manipulation.

Scoring of search terms in this section is relative. Value of 100 has been attributed to the most commonly searched query, value of 50 to a query searched half as often, and so on.

Image for post

What’s trending on YouTube?

Notebook-related queries on YouTube look very similar to those on Google, viewers have been asking about Jupyter and Python, installation, and Anaconda setup.

Image for post

4. Data sources

For the analyses in this article, we created a representative dataset of 700 Jupyter notebooks that favors faster processing and maps key notebook trends. We also mined new insights from a dataset gathered by the Datalore team and looked at other datasets for comparison. Here’s a closer look at the data sources and how you can build on top of them:

1. GitHub API, Deepnote 2020 mini dataset — Yashika Sharma curated a smaller dataset of top ~700 Github notebooks from 2020, you can find the metadata, and all JSON notebook contents in our Deepnote project. Feel free to duplicate the project , and look for more insights in the data.
2. Datalore 10M dataset — In December, Datalore published a blogpost called We Downloaded 10,000,000 Jupyter Notebooks From Github — This Is What We Learned, with an accompanying notebook. They have curated a dataset with 10M notebooks (5TB), and they provide a simplified 3GB version with filtered CSVs. The filtered data include notebook names, imports, versions, and text stats. Authors also calculate a consistency of notebooks by reexecuting cells, and comparing the outputs. You can access all 10M notebooks directly in github-notebooks-update1 s3 bucket. You can also access the smaller dataset right in our supporting Deepnote project.
3. Google and YouTube search trends

5. Conclusion & future research on data science notebooks

We’ve seen that it’s very easy to analyze code from GitHub, and we’ve been able to find a representative sample with only hundreds of notebooks. Datalore folks already shown very unique consistency analysis, but we think there is a lot more that we can find in the notebooks. Feel free to adjust & extend our analysis — and show us when you do at @DeepnoteHQ on Twitter.

Original article here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.