NLP was one of the hottest skills in 2019 and 2020 for good reason. Companies have a lot of text to work with and many applicants to apply it across the business. We will discuss the top applications of NLP in part II of this two-part blog series, but first, we will focus on the top NLP skills for 2021, such as the languages, tools, and frameworks most commonly used in NLP.
To construct our NLP skills list, we’ve combed through thousands of NLP jobs over the last year (employing NLP) to see what was in-demand and what that means for the year ahead. Obviously, data science and machine learning got top mentions for this role but we wanted to dive a little deeper. Surprisingly, given the major breakthroughs in NLP especially around transfer learning, many of the NLP skills requested were around more established methods. Here’s our list below in no particular order.
Judging from the jobs boards, The open-source fastText library and pretrained models are popular with many companies. It employs a word embedding method and is an extension of the word2vec model. Although deep neural network methods for NLP are now popular, they can be slow to train and test. FastText helps solve this problem by employing a hierarchical classifier instead of a flat classifier, and can be orders of magnitude faster especially if you have many categories.
RNN is one of the most widely used neural network architectures for NLP given its ability to deal with sequential data. RNN, LSTM, and other models deal with recurrent layers classes in Pytorch. Not surprisingly, many companies chose Pytorch primarily for its tight integration with the Python language and API that is friendly and easy to use versus other deep learning frameworks.
Released in 2015, spaCy was initially created to help small businesses better leverage NLP. Its practical design offers users a streamlined approach for accomplishing necessary NLP tasks. spaCy can be quite flexible, as it allows more experienced users the option of customizing just about any of its tools.
NLP has seen frequent updates and advancements over the past few years, largely thanks to the increasing amount of text-based data. Tools like BERT, GPT-2, and Hugging Face’s Transformers library have helped build these newer models. Though, it can be difficult to get started with them. By using Novetta’s open-source AdaptNLP framework, you can more easily and quickly use those advanced tools and NLP skills, which allows users to use fine-tuned pre-trained language models for text classification, question answering, entity extraction, and part-of-speech tagging.
Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines and it is commercially supported by John Snow Labs. Spark NLP’s annotators utilize rule-based algorithms, machine learning and some of them Tensorflow running under the hood to power specific deep learning implementations. The library covers many common NLP tasks, including tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, spell checking, named entity recognition, and more.
If you’re familiar with NLP at all, then you’ve likely heard about BERT (Bidirectional Encoder Representations from Transformers) a lot in late 2019-early 2020. This tool, developed by Google, allows users to create their own question-and-answer model, among other things, with relative ease and speed. The main selling point for BERT is that it helps Google better understand the nuances and context of words in searches and better match those queries with more relevant results. This made waves in not just the data science community, but it also changed how SEO practitioners look at featured snippets and search results.
7. Hugging Face
Hugging Face is a company creating open-source libraries for powerful yet easy to use NLP like tokenizers and transformers. The Hugging Face Transformers library provides general purpose architectures, like BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, and T5 for Natural Language Understanding (NLU) and Natural Language Generation (NLG). It currently includes thousands of pre-trained models in 100+ languages. These models are both easy to use, powerful and performant for many NLP tasks. Model training, evaluation, and sharing can be achieved through a few lines of code.
CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations.
Generative Pre-trained Transformer 3 (GPT-3) is a language model that uses deep learning to produce human-like text. GPT-3 is the most recent language model coming from the OpenAI research lab team. They announced GPT-3 in a May 2020 research paper, “Language Models are Few-Shot Learners.” While a tool like this may not be something you use daily as an NLP professional, it’s still an interesting skill to have. Being able to spit out human-like text, answer questions, and even create code, it’s a fun factoid to have.
Short for sequence-to-sequence learning, seq2seq is a general-purpose encoder-decoder framework for TensorFlow that can be used for machine translation, text summarization, conversational modeling, image captioning, and more. Developed by Google, seq2seq’s main strength is translating a sentence from one language to another.
11. Alexa API
Given how many people use smart home devices – like Amazon’s Alexa – it’s good to know how to develop for them. In this case, being able to develop for Alexa, such as question-and-answering models, making purchases via voice, and so on, are strong skills to have.
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language. With NLTK, we can search for a word meaning by using a built-in lexical database called WordNet. WordNet presents nouns, verbs, adjectives, and adverbs grouped in sets of cognitive synonyms — synsets — with each synset representing a distinct concept.
Scikit-learn has been around for quite a while and is widely used by in-house data science teams. Thus it’s not surprising that it’s a platform for not only training and testing NLP models but also NLP and NLU workflows. In addition to working well with many of the libraries already mentioned such as NLTK, it has its own extensive library of models. Many NLP and NLU projects involve classic workflows of feature extraction, training, testing, model fit, and evaluation, meaning scikit-learn’s pipeline module fits this purpose well.
Last but certainly not least is that the one thing you may have noticed that these libraries and frameworks have in common is many are implemented with the Python language. R, Julia, and even Java may also be in demand but it seems companies building NLP and NLU strongly prefer Python. Many of the tools and frameworks tightly integrate with Python and it will continue to be the language of choice for 2021 for NLP.
Learn NLP Skills with Ai+
The ODSC on-demand training platform, Ai+ Training, offers a number of videos that will help you get up-to-date on the latest NLP skills, tricks, tools, platforms, libraries, and research advancements. Here are a few standout talks:
An Introduction to Transfer Learning in NLP and HuggingFace Tools | Thomas Wolf, PhD | Chief Science Officer | Hugging Face
Natural Language Processing Case-studies for Healthcare Models: Veysel Kocaman | Lead Data Scientist and ML Engineer | John Snow Labs
Transform your NLP Skills Using BERT (and Transformers) in Real Life: Niels Kasch, PhD | Data Scientist and Founding Partner | Miner & Kasch
A Gentle Intro to Transformer Neural Networks: Jay Alammar | Machine Learning Research Engineer | jalammar.github.io
Level Up: Fancy NLP with Straightforward Tools: Kimberly Fessel, PhD | Senior Data Scientist, Instructor | Metis
Build an ML pipeline for BERT models with TensorFlow Extended – An end-to-end Tutorial: Hannes Hapke | Senior Machine Learning Engineer | SAP Concur
Natural Language Processing: Feature Engineering in the Context of Stock Investing: Frank Zhao | Senior Director, Quantamental Research | S&P Global
Transfer Learning in NLP: Joan Xiao, PhD | Principal Data Scientist | Linc Global
Developing Natural Language Processing Pipelines for Industry: Michael Luk, PhD | Chief Technology Officer | SFL Scientific
Deep Learning-Driven Text Summarization & Explainability: Nina Hristozova | Junior Data Scientist | Thomson Reuters