2020 was a roller coaster, but the data science community is going strong. Interest in the data science domain has grown in the past year yet again. We dug into the data to learn more about the current state of a vital part of the data science ecosystem: the notebooks.
The analyses in this article were built using a Deepnote notebook. Click the link to run our code and explore the data yourself.
This article on data science notebooks consists of 5 key sections:
1. First, we explore key stats and trends of Jupyter notebooks on GitHub.
2. We double down on popular Python libraries and show you what libraries to add to your toolkit for plotting, ML, NLP, and other use cases.
3. We explore search trends from Google & YouTube.
4. Data sources and how you can build on top of them.
5. Conclusion & ideas for future research.
1. Notebooks on GitHub
First, we analyzed repositories containing Jupyter Notebooks on Github. Here are some general stats for 2020:
– Number of created repositories containing Jupyter Notebooks: 10,176
– Number of commits: 13,1753
– Number of issues: 51,887
– Number of discussions: 101
Most popular notebook repos
Love is love — heart or stars, it’s all the same. The most starred repository of 2020 is Fast AI’s Fastbook with 11k stars, 39 contributors, and 3.4k forks. The repo covers an introduction to deep learning, fast.ai, and PyTorch.
Fast AI wins the second spot too with the FastPages repository — a blogging platform (2k stars, 91 contributors, 409 forks).
The all-time favorite repository with 27.3k stars is Python Data Science Handbook created in 2017. This one contains the entire Python Data Science Handbook, in the form of Jupyter notebooks.
Most used Python versions
Since Python is the most popular language in Jupyter notebooks, we found a variety of versions used during our analysis. Python 3.6 is the most used Python version with over 55% of users, followed by Python 3.7 at 36.5%. Python 3.5 and 2.7 only have around 0.5% of users each.
Here’s the open-source license distribution — MIT License takes the lead, followed by Apache 2.0.
Most common kernel names
This one is a bit technical. The most common kernels are variations of python3, but we also see a heavy use of conda.
2. Doubling down on popular Python libraries
We’ve looked at the library popularity. In our small GitHub 2020 dataset, we’ve found matplotlib, numpy, and pandas the most popular.
To provide a more in-depth view, we’ve analyzed the Datalore 10M dataset and found very similar results. No surprise here, numpy, pandas, and matplotlib are the top 3 most-imported libraries.
Take a look at the top 20 libraries overall:
In the sections below, we categorize the most used Python libraries across different subject areas. Which of these will you add to your toolkit in 2021?
Matplotlib is the most popular plotting library, with a clear lead over seaborn and plotly.
In machine learning, TensorFlow has been the most popular library with 40.0% of users importing it for their ML tasks, closely followed by keras at 34.1%.
For natural language processing, nltk is the clear #1 with 63.0% of imports.
Geospatial analysis libraries
For geospatial analyses, folium has been the most popular library, followed by geopandas and shapely.
Zipfile takes #1 with 48.4% of users importing the library for their compression tasks.
Here’s a look at a couple of other subject-specific libraries gaining popularity in notebooks:
– Chemistry: pymatgen
– Medical imaging: nibabel
– Astronomy: astropy
It’s important to note that many researchers from similar domains are still not using notebooks, or Python for that matter.
3. Notebooks in search
What does Google search reveal?
We’ve also had a look at what people have been searching for in relation to notebooks in 2020. Top 10 search queries ask about Jupyter notebooks, Python and .ipynb file manipulation.
Scoring of search terms in this section is relative. Value of 100 has been attributed to the most commonly searched query, value of 50 to a query searched half as often, and so on.
What’s trending on YouTube?
Notebook-related queries on YouTube look very similar to those on Google, viewers have been asking about Jupyter and Python, installation, and Anaconda setup.
4. Data sources
For the analyses in this article, we created a representative dataset of 700 Jupyter notebooks that favors faster processing and maps key notebook trends. We also mined new insights from a dataset gathered by the Datalore team and looked at other datasets for comparison. Here’s a closer look at the data sources and how you can build on top of them:
1. GitHub API, Deepnote 2020 mini dataset — Yashika Sharma curated a smaller dataset of top ~700 Github notebooks from 2020, you can find the metadata, and all JSON notebook contents in our Deepnote project. Feel free to duplicate the project , and look for more insights in the data.
2. Datalore 10M dataset — In December, Datalore published a blogpost called We Downloaded 10,000,000 Jupyter Notebooks From Github — This Is What We Learned, with an accompanying notebook. They have curated a dataset with 10M notebooks (5TB), and they provide a simplified 3GB version with filtered CSVs. The filtered data include notebook names, imports, versions, and text stats. Authors also calculate a consistency of notebooks by reexecuting cells, and comparing the outputs. You can access all 10M notebooks directly in github-notebooks-update1 s3 bucket. You can also access the smaller dataset right in our supporting Deepnote project.
3. Google and YouTube search trends
5. Conclusion & future research on data science notebooks
We’ve seen that it’s very easy to analyze code from GitHub, and we’ve been able to find a representative sample with only hundreds of notebooks. Datalore folks already shown very unique consistency analysis, but we think there is a lot more that we can find in the notebooks. Feel free to adjust & extend our analysis — and show us when you do at @DeepnoteHQ on Twitter.
Original article here. Reposted with permission.