Using NLP to identify Adverse Drug Events (ADEs) Using NLP to identify Adverse Drug Events (ADEs)
An adverse drug event (ADE) is defined as harm experienced by a patient as a result of exposure to a medication. A... Using NLP to identify Adverse Drug Events (ADEs)

An adverse drug event (ADE) is defined as harm experienced by a patient as a result of exposure to a medication. A significant amount of information about drug-related safety issues such as adverse effects is published in medical case reports that usually can only be explored by human readers due to their unstructured nature.

In this tutorial, we will train a Natural Language Processing (NLP) model to identify ADEs in a given text. We will use an ADE dataset from the Hugging Face Dataset Hub to teach the model the difference between ADE-related and non-ADE-related texts. We will use Hugging Face and Amazon SageMaker to train and deploy the model and test it with phrases of our own.

Why is this important?

According to the Office of Disease Prevention and Health Promotion, ADEs have severe impacts: In inpatient settings, ADEs account for an estimated 1 in 3 of all hospital adverse events, affect about 2 million hospital stays each year, and prolong hospital stays by 1.7 to 4.6 days. In outpatient settings, ADEs account for over 3.5 million physician office visits, an estimated 1 million emergency department visits, and approximately 125,000 hospital admissions.

Being able to automate the flagging of ADE-related phrases in unstructured text data with an AI model can help prevent ADEs and result in safer and higher quality health care services, reduced health care costs, more informed and engaged consumers, and improved health outcomes.

The data

For this tutorial, we will use the ADE-Corpus-V2 Dataset. This dataset was published on Hugging Face in May 2021 and is publicly available. It consists of 3 subsets, (1) an ADE classification corpus, (2) a drug ADE relation corpus, and (3) a drug dosage relation corpus. For this tutorial, we will concentrate on the classification subset, which classifies sentences based on whether they are ADE-related (label=1) or not (label=0). This is what the dataset looks like:

Image by author

Plan of attack

There are several steps required to train an NLP model to identify ADE-related phrases: First we need to download and process the dataset which results in separate training, validation, and test datasets. We then train the model with the training and validation dataset. After that, we deploy it to an endpoint where we can test the model. The diagram below illustrates the order of steps:

Image by author

All of the code for this tutorial can be found in this Github repo . The repo is split into 4 notebooks that reflect the steps outlined above: 1_data_prep.ipynb (data processing), 2_inspect_data_optional.ipynb (which lets us have a look at the processed data before training), 3_train.ipynb (model training), 4_deploy.ipynb (model deployment and testing).

Processing the data

To process the data we will leverage Hugging Face Processing jobs on Amazon SageMaker. This allows us to spin up a compute instance that executes the processing script. After the job is done, the compute instance will automatically shut down and we will only be charged for the time it took to process the data.

Let’s have a look at some of the key parts of the processing script:

The processing scripts do a few things: downloading the data, removing duplicates, shuffling and splitting the data, tokenizing it, and saving it to S3.

Shuffling and splitting the data

There are many ways to load a dataset, remove duplicates, and split it into several parts. In this script, we leverage the Pandas and Numpy libraries to accomplish this, but the Dataset class also has methods to filter and split datasets.

One way to shuffle and split the datasets into three parts is to use Numpy’s split() function. It allows us to split a dataset at defined thresholds. For example, to shuffle and then split one dataset into 70% training data, 20% validation data, and 10% test data we can use this line of code:

train, val, test = np.split(df.sample(frac=1), [int(0.7*len(df)), int(0.9*len(df))])

Tokenizing the data

To make the datasets ready for model training we will tokenize the datasets before saving them to S3:

tokenizer = AutoTokenizer.from_pretrained(args.model_name)

def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds))
val_ds = val_ds.map(tokenize, batched=True, batch_size=len(val_ds))
test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds))

Training the model

Training a Hugging Face model on SageMaker is straightforward, thanks to the partnership between Hugging Face and AWS. The training notebook in the Github repo follows the best practices on how to train an NLP model and uses SageMaker Training Jobs to spin up ephemeral compute instances to train the model. These compute instances are only up for training the model and are immediately torn down after the training has been completed. The model is saved on S3 and can be downloaded from there to host the model anywhere:

Image by author

Deploying the model

For this tutorial we will make our lives easier by grabbing the model we just trained and deploying it on S3 like so (see deployment notebook in Github repo):

huggingface_model = HuggingFaceModel(

predictor = huggingface_model.deploy(

This will create a SageMaker endpoint which we then can interact with via the SDK. This allows us to easily test and play around with the model.

Testing the model

Once the model is deployed we can test it with the test data (or some made-up sentences):

Image by author


In this tutorial, we trained an NLP model to identify adverse drug events (ADEs) in a given phrase. We used an annotated corpus designed to support the extraction of information about drug-related adverse effects from medical case reports to train the NLP model and deployed it to a SageMaker Endpoint. The next steps could include an automated model evaluation using a dedicated evaluation script and publishing the model on Hugging Face’s Model Hub.


The dataset used for this tutorial is a product of the work described in this source article:

Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012.
http://www.sciencedirect.com/science/article/pii/S1532046412000615, DOI: https://doi.org/10.1016/j.jbi.2012.04.008

Article originally posted here by Heiko Hotz Reposted with permission.

About the author: Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning at AWS with over 20 years of experience in the technology sector. He focuses on Natural Language Processing (NLP) and helps AWS customers to be successful on their NLP journey.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.