Natural Language Processing remains one of the hottest topics of 2022. By using GitHub stars (albeit certainly not the only measure) as a proxy for popularity, we took a look at what NLP projects are getting the most traction so far this year, just as we recently did with machine learning projects. It’s a list with some familiar names but there are plenty of surprises also!
State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX | Star Gain: 2,521 | https://github.com/huggingface/transformers
Practitioners love the transformer projects and with 2,154 stars since January, it’s easily our number one pick. The library provides easy-to-use, state-of-the-art models that have expanded beyond NLP transformers to include PyTorch, JAX, and TensorFlow. Providing a unified API for using pre-trained models allows a lower barrier to entry for AI practitioners in both NLP understanding & generation as well as computer vision and audio tasks.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities | Star Gain: 858 | https://github.com/microsoft/unilm
UniLM was proposed in a paper by Li Dong et al and presented at some of the top academic conferences including Neurips(‘19), ICML(‘20), and ACL(‘21). UniLM is a unified pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The models were pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. It remains a popular and active project with new pre-trained models added very recently including BEiT-3, SimLM, DiT, LayoutLMv3, and MetaLM to name a few.
TensorFlow code and pre-trained models for BERT | Star Gain: 832 | https://github.com/google-research/bert
Proposed in a 2018 paper and credited with over 46,500 citations, you probably already know of BERT and its transformational role in revolutionizing NLP. BERT’s architecture allows it to understand bi-directional content that delivers state-of-the-art results on NER, language understanding, question answering, and several other general NLP tasks. Pre-trained on a massive corpus (by 2018 standards), it is still very popular in today’s LLM (large language model) space. It’s not the most active project with the most recent update in March 2020 that added two dozen smaller BERT models to the set.
Two popular related projects are BERTopic (Star Gain, 612) for Leveraging BERT and c-TF-IDF to create easily interpretable topics and BertViz (Star Gain, 452) an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5.
Image Credit: https://github.com/jessevig/bertviz
Open source machine learning framework for dialogue management | Star Gain: 832 https://github.com/RasaHQ/rasa
Conversational assistants are a top NLP use case and Rasa is a python-based open-source machine learning framework to automate text-and voice-based assistants on Twillo, Slack, MS Bot, Facebook Messenger, and others. Rasa modules include NLU which deals with natural language understanding, and Core which handles the API and also leverages deep learning models such as LSTM and reinforcement learning to provide text predictions.
Comprehensive and Easy-to-use NLP Toolkit | Star Gain: 749 | https://github.com/alibaba/EasyNLP
Just released in June of this year, this PyTorch-based NLP project has quickly attracted a following. Originally built by Alibaba in 2021, EasyNLP provides easy-to-use and concise commands to call cutting-edge models that cover a broad collection of NLP algorithms for many common NLP real-world applications. It integrates knowledge distillation and few-shot learning for landing large pre-trained models, together with various popular multi-modality pre-trained models that include DKPLM and KGBERT. It’s yet another unified framework that includes model training, inference, and deployment.
Industrial-strength Natural Language Processing in Python | Star Gain: 742 | https://github.com/explosion/spaCy
A favorite library for any python developer, spaCy is the go-to library for end-to-end NLP workflows. It handles not just basic NLP tasks such as tokenization, parsing, NER, tagging, and text classification but now also incorporates pre-trained transformer models such as BERT. Developing ML pipelines is an important part of today’s NLP systems and spaCy training pipelines weave together the various components such as parsers, taggers, NER, and lemmatizers to help automate NLP workflows. You can easily customize your pipeline by replacing, adding, and removing various components to build scalable production-level NLP.
Open source NLP framework that leverages pre-trained Transformer models | Star Gain: 627 | https://github.com/deepset-ai/haystack
To sum it up, Haystack is a Q&A framework and per its Github description, “Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want to perform Question Answering or semantic document search, you can use the State-of-the-Art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language. Haystack is built in a modular fashion so that you can combine the best technology from other open-source projects like Huggingface’s Transformers, Elasticsearch, or Milvus.”
A very simple framework for state-of-the-art Natural Language Processing | Star Gain: 395 | https://github.com/flairNLP/flair
Yet another PyTorch and Python library, Flair includes text classification, pretrained name-entity-recognition, and part-of-speech-tagging in addition to building your own models. What differentiates Fliar is its simple API that wraps BERT, ELMo, and other popular models. Flair sequence tagging models such as NER and part-of-speech tagging, etc, are now hosted on the HuggingFace model hub. Flair is similar to spaCy but perhaps has better language support and depending on the use case, Flair may be more suitable
Build AI-powered semantic search applications | Star Gain: 339 | https://github.com/neuml/txtai
Thanks to improved performance, semantic search is accelerating its reach into ML workflows and open source projects are leading the way. Using vectors to identify search results that have the same meaning from different keywords is what Txtai excels at. Built on HuggingFace Transformers and FastAPI provides not just model training but workflows and pipelines that include question-answering, zero-shot labeling, machine translation, language detection, and audio files to text to name a few. Additional Txtai use cases include text labeling, image search, article summarizing data, and entity extraction.
Image Credit: https://github.com/neuml/txtai
Topic Modelling for Humans | Star Gain: 310 | https://github.com/RaRe-Technologies/gensim
In recent years, topic modeling has expanded from simple extraction and grouping of similar works from documents to more powerful techniques. Around for over a decade, Gensim is one of the more popular Python-based unsupervised topic modeling libraries among NLP projects. Key features include the ability to easily add your own corpus and extensive implementations of popular topic modeling algorithms including online word2vec deep learning, Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP), and others. It’s scalable thanks to multicore implementations of many of its algorithms and can quickly and easily handle large corpses of documents
The Natural Language Toolkit | Star Gain: 306 | https://github.com/nltk/nltk
No list of NLP projects and toolkits would be complete without a mention of NLTK, currently at version 3.10. This expansive Python toolkit also includes data sets and tutorials supporting research and development. Often compared to spaCy, and tagged as a research tool rather than a production tool, NLKT does provide more direct access (less abstraction) to NLP tasks. It’s certainly the go-to library for beginners thanks to its comprehensive basic NLP task library.
Data augmentation for NLP | Star Gain: 261 | https://github.com/makcedward/nlpaug
Thanks to Large Language Models(LLMs) and other trends in NLP, data augmentation and synthetic data generation are gaining more attention but are still relatively new fields and certainly new techniques for many AI practitioners. The goal of data augmentation is to increase the diversity of data for training without increased data collection. nlpaug is a python library that helps you with augmenting NLP for your machine learning projects. The library includes two key modules: Augmenter, which is the basic element of augmentation, while Flow is a pipeline to orchestrate multiple augmenters together. The library can generate synthetic data in a few lines of code and plays nice with other popular frameworks including Tensorflow, PyTorch, and sci-kit learn.
Image Credit: https://github.com/makcedward/nlpaug
Learn more about NLP and NLP Projects at ODSC West 2022
There’s a lot to learn about these trending NLP projects, how to use NLP, and how to implement NLP into a business. By attending ODSC West 2022 this November 1st-3rd, and checking out the NLP Track, you can learn how to do all of that with expert-led talks, training sessions, and workshops. Here are a few sessions that you can attend there.
- Self-Supervised and Unsupervised Learning for Conversational AI and NLP
- Building Modern Search Pipelines with Haystack, Large Language Models, and Hybrid Retrieval
- Bagging to BERT – A Tour of Applied NLP
- Applications of NLP in Retail/E-commerce
- Hyper-productive NLP with Hugging Face Transformers
- The Next Thousand Languages
- Transforming The Retail Industry with Transformers