An adverse drug event (ADE) is defined as harm experienced by a patient as a result of exposure to a medication. A significant amount of information about drug-related safety issues such as adverse effects is published in medical case reports that usually can only be explored by human readers due to their unstructured nature.
In this tutorial, we will train a Natural Language Processing (NLP) model to identify ADEs in a given text. We will use an ADE dataset from the Hugging Face Dataset Hub to teach the model the difference between ADE-related and non-ADE-related texts. We will use Hugging Face and Amazon SageMaker to train and deploy the model and test it with phrases of our own.
Why is this important?
According to the Office of Disease Prevention and Health Promotion, ADEs have severe impacts: In inpatient settings, ADEs account for an estimated 1 in 3 of all hospital adverse events, affect about 2 million hospital stays each year, and prolong hospital stays by 1.7 to 4.6 days. In outpatient settings, ADEs account for over 3.5 million physician office visits, an estimated 1 million emergency department visits, and approximately 125,000 hospital admissions.
Being able to automate the flagging of ADE-related phrases in unstructured text data with an AI model can help prevent ADEs and result in safer and higher quality health care services, reduced health care costs, more informed and engaged consumers, and improved health outcomes.
For this tutorial, we will use the ADE-Corpus-V2 Dataset. This dataset was published on Hugging Face in May 2021 and is publicly available. It consists of 3 subsets, (1) an ADE classification corpus, (2) a drug ADE relation corpus, and (3) a drug dosage relation corpus. For this tutorial, we will concentrate on the classification subset, which classifies sentences based on whether they are ADE-related (label=1) or not (label=0). This is what the dataset looks like:
Image by author
Plan of attack
There are several steps required to train an NLP model to identify ADE-related phrases: First we need to download and process the dataset which results in separate training, validation, and test datasets. We then train the model with the training and validation dataset. After that, we deploy it to an endpoint where we can test the model. The diagram below illustrates the order of steps:
Image by author
All of the code for this tutorial can be found in this Github repo . The repo is split into 4 notebooks that reflect the steps outlined above: 1_data_prep.ipynb (data processing), 2_inspect_data_optional.ipynb (which lets us have a look at the processed data before training), 3_train.ipynb (model training), 4_deploy.ipynb (model deployment and testing).
Processing the data
To process the data we will leverage Hugging Face Processing jobs on Amazon SageMaker. This allows us to spin up a compute instance that executes the processing script. After the job is done, the compute instance will automatically shut down and we will only be charged for the time it took to process the data.
Let’s have a look at some of the key parts of the processing script:
The processing scripts do a few things: downloading the data, removing duplicates, shuffling and splitting the data, tokenizing it, and saving it to S3.
Shuffling and splitting the data
There are many ways to load a dataset, remove duplicates, and split it into several parts. In this script, we leverage the Pandas and Numpy libraries to accomplish this, but the Dataset class also has methods to filter and split datasets.
One way to shuffle and split the datasets into three parts is to use Numpy’s split() function. It allows us to split a dataset at defined thresholds. For example, to shuffle and then split one dataset into 70% training data, 20% validation data, and 10% test data we can use this line of code:
train, val, test = np.split(df.sample(frac=1), [int(0.7*len(df)), int(0.9*len(df))])
Tokenizing the data
To make the datasets ready for model training we will tokenize the datasets before saving them to S3:
tokenizer = AutoTokenizer.from_pretrained(args.model_name) def tokenize(batch): return tokenizer(batch['text'], padding='max_length', truncation=True) train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds)) val_ds = val_ds.map(tokenize, batched=True, batch_size=len(val_ds)) test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds))
Training the model
Training a Hugging Face model on SageMaker is straightforward, thanks to the partnership between Hugging Face and AWS. The training notebook in the Github repo follows the best practices on how to train an NLP model and uses SageMaker Training Jobs to spin up ephemeral compute instances to train the model. These compute instances are only up for training the model and are immediately torn down after the training has been completed. The model is saved on S3 and can be downloaded from there to host the model anywhere:
Image by author
Deploying the model
For this tutorial we will make our lives easier by grabbing the model we just trained and deploying it on S3 like so (see deployment notebook in Github repo):
huggingface_model = HuggingFaceModel( model_data="s3://<YOUR_S3_PATH>/model.tar.gz", role=role, transformers_version="4.6", pytorch_version="1.7", py_version="py36", ) predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.p3.2xlarge" )
This will create a SageMaker endpoint which we then can interact with via the SDK. This allows us to easily test and play around with the model.
Testing the model
Once the model is deployed we can test it with the test data (or some made-up sentences):
Image by author
In this tutorial, we trained an NLP model to identify adverse drug events (ADEs) in a given phrase. We used an annotated corpus designed to support the extraction of information about drug-related adverse effects from medical case reports to train the NLP model and deployed it to a SageMaker Endpoint. The next steps could include an automated model evaluation using a dedicated evaluation script and publishing the model on Hugging Face’s Model Hub.
The dataset used for this tutorial is a product of the work described in this source article:
Gurulingappa et al., Benchmark Corpus to Support Information Extraction for Adverse Drug Effects, JBI, 2012.
http://www.sciencedirect.com/science/article/pii/S1532046412000615, DOI: https://doi.org/10.1016/j.jbi.2012.04.008
Article originally posted here by Heiko Hotz Reposted with permission.
About the author: Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning at AWS with over 20 years of experience in the technology sector. He focuses on Natural Language Processing (NLP) and helps AWS customers to be successful on their NLP journey.