The Ten Most Important Data Science GitHub Repositories for 2022
Featured Post2022posted by ODSC Team December 14, 2022 ODSC Team
Every year, the data science landscape changes. With new tools, frameworks, packages, languages, and verticals changing the game, it’s always interesting to see what people are using. We looked at all data science GitHub repositories and Python download stats and picked out ten of the most important and popular ones that made an impact this year.
Superset is a must-try project for any ML engineer, data scientist, or data analyst. Features include an intuitive interface for visualizing datasets and building interactive dashboards. Performance is impressive, has an impressive integration library, and solid security and authentication. The no-code visualization builds are a handy feature. Apache Superset remains popular thanks to how well it gives you control over your data.
Pandas is a popular data analysis library built on top of the Python programming language, and getting started with Pandas is an easy task. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential tool in the modern data analyst’s toolbox. Pandas’ easy-to-use nature will keep it relevant for years to come.
Prefect 2 is the second-generation dataflow coordination and orchestration platform from Prefect. Prefect 2 has been designed from the ground up to handle the dynamic, scalable workloads that the modern data stack demands. Powered by Prefect Orion, a brand-new, asynchronous rules engine, it represents an enormous amount of research, development, and dedication. Prefect is a popular tool amongst the data engineering community and MLOps practitioners thanks to its open-source, flexible nature.
As the most downloaded low-code Python framework for building machine learning and data science web apps, Dash saw increased use throughout the year. Built on top of Plotly.js, React, and Flask, Dash ties modern UI elements like dropdowns, sliders, and graphs directly to your analytical Python code.
PyTorch has quickly been gaining steam as a leading deep learning framework, with many practitioners choosing it over TensorFlow as their go-to framework of choice lately. PyTorch is an open-source framework built for developing machine learning and deep learning models. In particular, this framework provides the stability and support required for building computational models in the development phase and deploying them in the production phase.
Developed by Google, TensorFlow is still the leading machine learning framework amongst machine learning practitioners. However, TensorFlow is sometimes used interchangeably with Keras considering how many people use Keras for deep learning, a high-level API of TensorFlow. Keras is the high-level API of TensorFlow 2: an approachable, highly-productive interface for solving machine learning problems, with a focus on modern deep learning.
Hugging Face Transformers
Hugging Face Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Hugging Face has made a name for itself especially within the NLP realm for text classification, language modeling, and more.
The Gradio python library was developed to create web demos from machine learning models. With Gradio, you can quickly create a beautiful user interface around your machine learning models or data science workflow and let people “try it out” by dragging-and-dropping in their own images, pasting text, recording their own voice, and interacting with your demo, all through the browser. It stands out for how easy and fast it is to use, even for those not data science-savvy.
scikit-learn is a machine learning library for Python that will help you access just about any ML algorithm you may need for the project you’re working on. It’s worth noting that programmers specifically designed scikit-learn for use in tandem with SciPy and NumPy. Typical applications include model selection, regression, dimensionality reduction, clustering, and classification. It’s been used for years thanks to how robust it is, its active supporting community, and its broad potential for use.
Proposed in a 2018 paper and credited with over 46,500 citations, you probably already know of BERT and its transformational role in revolutionizing NLP. BERT’s architecture allows it to understand bi-directional content that delivers state-of-the-art results on NER, language understanding, question answering, and several other general NLP tasks. Pre-trained on a massive corpus (by 2018 standards), it is still very popular in today’s LLM (large language model) space. It’s not the most active project with the most recent update in March 2020 that added two dozen smaller BERT models to the set.
Additional Popular GitHub Repositories
The above repositories are just ten that we felt are with mentioning and proved to be popular on GitHub among its users. Below, you’ll find more popular repositories sorted by category.
How to learn more about these GitHub repositories and other tools
Everything above has made an impact in the field of data science in 2022. From using these GitHub repositories in research to discovering new applications across verticals, it’s worth checking out the relevant GitHub repositories if not all of the above. If you want to learn more about the research behind these repositories and how to use them in a real-world setting, then be sure to sign up for an Ai+ Training subscription or attend ODSC East 2023 this May 9th-11th while tickets are still 70% off.