Build NLP and Conversational AI Apps with Transformers and Large Scale Pre-Trained Language Models Build NLP and Conversational AI Apps with Transformers and Large Scale Pre-Trained Language Models
Transformers have taken the AI research and product community by storm. We have seen them advancing multiple fields in AI such... Build NLP and Conversational AI Apps with Transformers and Large Scale Pre-Trained Language Models

Transformers have taken the AI research and product community by storm. We have seen them advancing multiple fields in AI such as natural language processing (NLP), computer vision, and robotics. In this blog, I will share some background in conversational AI, NLP, and transformers-based large-scale language models such as BERT and GPT-3 followed by some examples around popular applications and how to build NLP apps.

Natural Language Processing and Conversational AI

Conversational AI involves technologies that make machines interact with humans (or other machines) in a natural and meaningful way. Interaction can involve a specific goal (e.g. “search movie for a weekend”) or non-goal oriented (e.g. social conversations) and can be based on speech, text, and sign language. Building NLP apps and conversational AI systems can involve several tasks such as speech processing, language understanding, dialog management, and language generation. Therefore, it leverages several technologies such as NLP, audio processing, and machine learning.

Conversational AI Leveraging NLP, ML — Created by Infopulse Shah (Medium, 2019)

Evolution of Conversational AI

Conversational AI has been transforming various industries such as automation, contact centers, and virtual assistants. They have undergone several phases of research and development. Prior to the 1990s, most systems were purely based on rules. Then came machine learning-based systems, however, it was still hard to do application-specific featurization of data, and managing multiple domains and scenarios. Post-2013, transfer learning, and deep learning-based systems further enhanced the performance substantially by scaling the system to millions of users across a variety of applications. To address these challenges, “Word-Embedding” based models were built in NLP, and “Skills-based” and “Domain-Intent-Slot” based systems were proposed in Conversational AI. Despite significant progress in the past decade, most systems rely on large amounts of data annotation for language understanding, configurations for dialog management, and templates for language generation. Within the last two years, transformers-based models have been used to depict the power of unsupervised learning and generative systems across all aspects of conversational AI: speech recognition, language understanding, dialog management, and language generation, and to build NLP apps.

Conversational AI Architecture — Created by Nisar Shah (Medium, 2018)

Language Models (LMs) and Transformers based Pre-trained Large Scale LMs

Language Model is a probability distribution over sequences of words. In simple words, LMs learn the sequence of words and their representation. Since we communicate through words, LMs learn the distribution of words for a given language or set of languages or for a given context. That is a good LM for a given language can be seen as the representation of the language itself. Since LMs are trained in a self-supervised manner, i.e. just observing and learning the sequence of words without knowing what words mean, they might know the meaning of the words. What they actually learn is the placement of words given some context. build NLP apps

Language Model Illustration — Source Chauhan Jainish (Medium, 2019)

LMs are of great importance for Conversational AI tasks and to build NLP apps. Once we build or train LMs, they can be used for a variety of applications by simply fine-tuning or updating it to a given task or data. Large Scale Pre-trained LMs such as BERT and GPT-3 are based on the same concept and therefore building them requires training on massive amounts of data (billions of sentences) with hundreds of millions (BERT), hundreds of billions (GPT-3), and trillions of parameters (Switch Transformers). These models are so big that they nearly memorize every single sentence and corresponding context and therefore are great at generating text. Some of the applications which involve sequence generation such as Music Generation, Story Generation, and Response Generation in Conversational AI systems have seen a dramatic improvement with these LMs.

build NLP appsPre-trained Large Scale Language Model Size — Source Search Engine Watch

How to use Pre-trained Language Models in NLP and Conversational AI Applications?

Organizations like Hugging Face and Google Colab through their open-source contributions have made it really easy for developers and researchers to leverage Pre-trained Large Scale LMs with just a few lines of code to build NLP apps. The open-source nature of such projects has dramatically minimized the pace of research and development. Along with optimizing and scaling, trying new ideas is becoming really easy. A developer just needs to identify the task they are interested in (e.g. text classification, question answering, entity recognition, etc.) and collect corresponding data. For each NLP and Conversational AI task, a catalog of several pre-trained models exist across a variety of languages, which can be used and further fine-tuned by a developer/user on their own task.

build NLP apps with hugging face transformers
Catalog of Pre-trained Large Scale Language Models and Datasets — Source Hugging Face


As mentioned above, Conversational AI and NLP involve a variety of tasks such as Text-Classification, Summarization, Text Generation, Translation and Question-Answering, and to build NLP apps. Any of these tasks can be easily invoked with Hugging Face transformers. To demonstrate this, let’s install “Huggingface Transformers and Datasets”:

pip install transformers
pip install datasets 

Once installed, let’s follow these steps:

1. Identify the task: You can start with any of the tasks listed above. Hugging Face supports a variety of models. Let’s choose the text classification task using AutoModelForSequenceClassification. Let’s also choose the “BERT” model.

2. Identify the model, config, and tokenizers: You can also choose a model using AutoClasses. You may also optionally use the config related to your task. Hugging Face has a catalog of config which you can start with:

from transformers import AutoConfig, AutoTokenizer,
config = AutoConfig.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_config(config)

3. Get and prepare your data: “Datasets” python library from Hugging Face has thousands of datasets. You can start from that or use your own using this library. Let’s take an example of IMDb sentiment classification dataset. We can either load the data directly using the “Datasets” library or build from scratch. Let’s first download it and then prepare it using these two approaches:

#Using datasets library:

from datasets import load_dataset
dataset = load_dataset("imdb")

#Or load data from scratch

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz

from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            labels.append(0 if label_dir is "neg" else 1)
    return texts, labels

# This could be a large dataset and your machine/gpu can go out of memory. You can sample from this dataset and experiment with a smaller slice.

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

# Further splitting train into train and validation sets for model to train in a better way

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.1)


4. Obtain encodings using Tokenizer and Tensor: Machines do not understand text and neural networks such as Transformers take numerical input. Therefore, text data needs to be first tokenized in smaller units and then encoded into rich numerical embeddings. Hence, we transform text data using the model’s tokenizer and then transform into tensors to be consumed in models:

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

5. Train the model using the Trainer class: Now we are all set. We can simply train and evaluate the model using Hugging Face Trainer Class.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset


6. Evaluate model on evaluate and test datasets:

eval_results = trainer.evaluate()
print("Evaluation Results: ", eval_results)

test_results = trainer.evaluate(eval_dataset=test_dataset)
print("Test Results: ", test_results)

Similar to the text classification example above, you can use other tasks as well. The code is available at the following github repository

Democratizing Conversational AI using Transformers and Pre-Trained Large Scale LMs

The true democratization of Conversational AI would involve providing access to all application users with models that have the capability to “Self Train” and “Self Manage”, by discovering patterns from data automatically. Then it’s not about data pipelines, and ML toolkits. The AI models deal with those themselves. It just makes Deep Learning and AI so much more accessible. Got It AI, which is one of the leading Conversational AI R&D firms, is making this vision of Democratization into reality by leveraging Transformers and Pre-trained Large Scale LMs. It has built Conversational AI models and products which “Self Train” and “Self Manage” and letting users and customers simply monitor and validate via “No Code AI”.

Editor’s note: Chandra is a speaker for ODSC East 2021. Check out his talk, “Advances in Conversational AI and NLP through Large Scale Language Models such as GPT-3,” there!

Author/ODSC East 2021 Speaker:

Chandra Khatri: Chief Scientist and Head of AI Research, Got It AI



Chandra Khatri is the Chief Scientist and Head of AI Research and Got It AI. He is also one of the leading experts in the field of Conversational AI. Prior to Got It AI he was leading Conversational AI and Multimodal efforts at Uber AI. He was the founding Scientist of Amazon Alexa Prize and has served as Chair or organized several AI conferences and workshops. He is best known for leveraging cutting-edge technologies and research for transforming products impacting hundreds of millions of users.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.