Top 12 Open Source Machine Learning Projects of 2022 (so far) Top 12 Open Source Machine Learning Projects of 2022 (so far)
2022 is rapidly progressing so it’s a good time to take stock of what’s trending in open source machine learning and... Top 12 Open Source Machine Learning Projects of 2022 (so far)

2022 is rapidly progressing so it’s a good time to take stock of what’s trending in open source machine learning and data science projects. These projects showcase the growth in the field of AI and highlight the current industry trajectory. Using GitHub stars, we tracked the top projects of 2022 so far.

Top 12 Open Source Machine Learning Projects of 2022 (so far)#1: DALL·E Mini

Generate images from a text prompt | Star gain: 2,521 | https://github.com/borisdayma/dalle-mini

You’ve certainly heard of Open AI’s DALL-E by now. The name is an apt blend of the Pixar character name WALL-E and the surrealist artist Salvador Dalí. The program takes a text phrase — like  “Last selfie ever taken” or  “Avacado on a chair” or basically anything else you could imagine and creates an image out of it. On Github, DALL-E mini is an online text-to-image generator that has gained quite a following with an impressive project architecture.

Image credit: https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained-with-Demo–Vmlldzo4NjIxODA

#2: Hugging Face 🤗 Transformers

State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX | Star gain: 2,154 | https://github.com/huggingface/transformers

Thanks to its huge selection of pre-trained models, Transformers (Hugging Face) was one of the top projects for 2021, and 2022 is proving no exception. The library has expanded beyond NLP transformers that include ML models or Pytroch, JAX, and Tensorflow. You can take advantage of their model hub for algorithms for NLP, computer vision, audio, and many others.

#3: DOLT

It’s Git for Data | Star gain: 1,909 | https://github.com/dolthub/dolt

Around since 2015, Dolt is getting a lot of recognition as a data versioning tool. It’s basically a SQL relational database with git semantics. This is ideal for diffs on data, table version, and conflict detection.  As the description says; it’s like Git and MySQL had a baby.

#4: IVY

The Unified Machine Learning Framework | Star gain: 1,868 | https://github.com/unifyai/ivy

Running legacy TensorFlow and or PyTorch and want to try out Jax or vice-versa?  IVY is a very new machine learning framework that gained some serious attention over the last six months thanks to its promise of enabling framework-agnostic functions, layers, and libraries that wrap JAX, TensorFlow, PyTorch, MXNet, and Numpy. Next up on the roadmap is a transpiler for automatic code conversions between all frameworks which will not doubt be quite impactful for many ML teams.

#5: MindsDB

In-Database Machine Learning | Star gain: 1,185 | https://github.com/mindsdb/mindsdb

In-database machine learning makes a lot of sense for many use cases and MindsDB is a popular open-source solution. Thanks to its flexible architecture, it supports most of the common relational databases including  MS SQL Server, ClickHouse, MySQL, and PostgreSQL. It also allows you to save models as tables, and even has AutoML capabilities.

#6 Deep Face Live

Real-time face swap for PC streaming or video calls | Star gain: 920 | https://github.com/iperov/DeepFaceLive

Yes, we all know you can be a cat on your next Zoom meeting thanks to OS tools like Deep Face Live, and there are many concerns around deep fakes. However, none of this has prevented these entertaining tools from being quite a popular project for fun, research, and definitely some mischief. 


#7 PyTorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration | Star gain: 829 | https://github.com/pytorch/pytorch

We’re back in familiar territory with PyTorch, one of the top open-source machine learning frameworks of the past few years. If anything, it seems to be increasing in popularity versus other ML frameworks. This easy-to-use Python library is ideal for optimized deep learning model building on either GPUs or CPUs.

#8 Instant-NGP

Instant neural graphics primitives: lightning fast NeRF and more | Star gain: 724 | https://github.com/NVlabs/instant-ngp

Incubated at NVIDIA Labs, this is an impressive deep learning project. Creating a fully-connected neural network that can generate unique views of complex 3D scenes, based on a partial set of images, can be slow and costly. This project promises near-instant training of neural graphics primitives on a single GPU and can even handle sparse image sets.

#9 Apache Superset

Data Visualization and Data Exploration Platform | Star gain: 626 | https://github.com/apache/superset 

Superset is a must-try project for any ML engineer, data scientist, or data analyst.  Features include an intuitive interface for visualizing datasets and building interactive dashboards. Performance is impressive, has an impressive integration library, and solid security and authentication. The no-code visualization builds is a handy feature.

#10 ColossalAI

A Unified Deep Learning System for Big Model Era | Star gain: 566 | https://github.com/hpcaitech/ColossalAI  

Thanks to their impressive performance, large pre-trained models are one of the top trends of the last few years. The promise of prebuilt models just requiring a bit of fine-tuning runs up against the realization that even tuning is prohibitively expensive. Less than a year old, ColossalAI is gaining fans for simplifying some of these tasks including, distributed training, parallelism, memory management, and inference.

#11 Gradio

Create UIs for your machine learning model in Python | Star gain: 551 | https://github.com/gradio-app/gradio

Gradio claims to be the fastest way to demonstrate your machine learning models and open-source users seem to agree. Building a full stack application around your outputs can be daunting, and ML models have different requirements than traditional software, so OS frameworks like Gradio are a welcome addition. A few lines of code will create a UI interface that can be embedded in a notebook or presented as a sharable webpage.

#12 Composer

Train neural networks up to 7x faster | Star gain: 546 | https://github.com/mosaicml/composer

Released in October of last year, Composer helped speed up neural network training with higher accuracy and thus lower cost. This PyTorch library has two dozen efficiency methods for both computer vision and language models. By keeping current with the latest research papers, they promise to update their library with the latest state-of-the-art in efficient neural network training.

Learn more about open source machine learning projects at ODSC West 2022

The above open source machine learning projects represent not only what’s already trending and in demand in the field of data science, but they also showcase what’s going to be a big deal in the months or years to come. As such, it’s important for any practicing or aspiring data scientist to stay up-to-date on everything trending in machine learning. At ODSC West 2022, coming this October 31st to November 3rd, you can learn about a number of these projects, get hands-on training in machine learning, and see what else there is to learn in the field of AI. Here are a few highlighted sessions as part of the machine learning track at ODSC West:

  • Reasoning About the Probabilistic Behavior of Classifiers 
  • Machine Learning with Python: A Hands-On Introduction
  • Beyond the Basics: Data Visualization in Python
  • Responsible AI Is Not an Option
  • Scalable, Real-Time Heart Rate Variability Biofeedback for Precision Health:  A Novel Algorithmic Approach
  • AI in a Minefield: Learning from Poisoned Data
  • StructureBoost: Gradient Boosting with Categorical Structure
  • Causal/Prescriptive Analytics in Business Decisions
  • Any Way You Want It: Integrating Complex Business Requirements into ML Forecasting Systems
  • Separating the Signal from the Noise: Signal Processing and Feature Extraction Techniques for Biological Data
  • Book Signing: Hands-On Data Analysis with Pandas – Second Edition: A Python Data Science Handbook for Data Collection, Wrangling, Analysis, and Visualization
  • Applications of NLP in Retail/E-commerce
  • Running Any ML Code in Any ML Framework
  • Introduction to Machine Learning 
  • Introduction to Python for Data Analysis

Tickets are currently available for both the in-person and virtual conference options. Register by this Friday, August 12th, for 60% off any ticket type. Act fast before the discount disappears!

Sheamus McGovern

Founder of ODSC and Software Architect specializing in, complex multi-platform systems across multiple industries including finance, healthcare, and education.