Natural Language Processing hit its big stride back in 2017 with the introduction of Transformer Architecture from Google. State of the art approaches helped bridge the gap between humans and machines and helped us build bots capable of using human language undetected. It’s an exciting time. You’re going to need some frameworks for NLP to continue that innovation, and here are some of the best ones around.
PyTorch is an open-source machine and deep learning library based on Torch. It’s often used for NLP and integrates with Facebook AI’s newest RoBERTa project. It’s fast and flexible, supports GPU computation, and operates RNNs for things like classification, tagging, and text generation.
SpaCy is fast and agile. It’s designed to amp up cutting edge NLP by making it practical and accessible. It works with other well-known libraries like Gensim and Scikit Learn. Written in Python and Cython, it’s optimized for performance and allows developers a more natural path to more advanced NLP tasks like named entity recognition.
Facebook AI XLM/mBERT
Facebook’s brand new multilingual language model that brings new training data sets, including those from low resource languages, to the table. If you’ve been working with languages other than English but lack proper data sets, this could be your answer.
XLM-R achieved the best results to date on four cross-lingual benchmarks becoming the first multilingual model to outperform traditional monolingual baselines. It performs particularly well for low-resource languages like Urdu or Swahili.
Otherwise known as “Enhanced Representation through kNowledge IntEgration,” ERNIE is a state of the art NLU framework offering pre-trained models that outperformed BERT in both English and Chinese. It includes continual pretraining and is word aware, structure-aware, and semantic aware.
TensorFlow remains one of the most popular frameworks for machine and deep learning, but you can translate that power to NLP tasks. Its most famous application, Google Translate, may spawn numerous jokes in the language learning world, but the fact that it gets close is an impressive feat. TensorFlow offers flexible, production-scale architecture capable of running on CPUs or GPUs.
Stanford’s generalized tool provides APIs for a number of common programming languages and can run as a simple web service. You can perform sentiment analysis, bootstrapped pattern learning, and named entity recognition across 53 languages with these neural models in addition to a whole suite of other common NLP tasks. It has intuitive syntax but may not be quite as customizable.
Keras runs on CPU or GPU, making it suitable for high level, deep learning. It’s a Python-based API, so it’s accessible for a variety of data scientists working in the field (hello, Python ecosystem!) It’s compatible with both convolutional and recurrent neural networks, and you can run it on top of CNTK, Theano, and of course, TensorFlow. Keras focuses on rapid iterations, enabling users to execute experiments efficiently. Plus, you’ve got all the usual NLP functions, including parsing, machine translation, and classification.
Chainer belongs to the Python ecosystem and is a standalone framework for deep learning. It comes in handy with RNNLM (recurrent neural network language models and modeling sequential data—think sentences in natural language. It’s great with variable length inputs and is highly accessible thanks to its Python foundation.
Gensim was explicitly designed for sentiment analysis and unsupervised topic modeling. It’s a workhorse with NLP, working with raw, unstructured data like a champ. The Gensim Word2Vec model helps with things like word embedding or processing academic documents, and it’s highly scalable for a variety of solutions.
While it’s not a general-purpose framework, if you’re working with its specific use cases, it’s a game-changer.
Scikit-Learn is an excellent framework for implementing things like regression and classification data. People often use it for classifying news publications, for example, or even working with tweets. It’s highly beginner-friendly and well documented, allowing those just starting in the field to get started quickly.
Scikit-Learn may not be the best option for higher-order NLP processes. Still, it’s an essential option for intuitive classification models, and it provides a baseline of ML algorithms to get started on a few different projects.
Getting Start with Frameworks for NLP Projects
Language is one of the great untapped resources of information. Now that we’ve begun to understand how to build programs to access this raw data, we’re able to process this kind of data. There are so many options out there for accessing and processing language-related information and unstructured data, but these should provide the best start for whatever project you’re doing.
Frameworks for NLP at ODSC East 2020
Hands-on training with these frameworks for NLP is a great way to become proficient in a new skill in the shortest amount of time. At ODSC East 2020, our NLP track features multiple talks on NLP to help get you up to speed.
Spark NLP is a popular library as it’s fully-scalable and can be used with multiple languages. At ODSC East 2020, leading NLP researcher David Talby of Pacific AI will discuss “State of the Art Natural Language Processing at Scale” to help teach you about Spark and how to scale your NLP initiatives.
Hugging Face is the most widely-used transformer library for NLP. Thomas Wolf, the Chief Science Officer at Hugging Face, will give a primer on the best ways to use it. BERT and RoBERTa, and GPT-2 have been making waves in 2019 and 2020 as popular pretraining methods for NLP, and in the talk “Transform your NLP Skills: Using BERT (and Transformers) in Real Life” with Niels Kasch, you’ll learn everything you need to know about starting and implementing these popular tools.