Model Overload — Which NLP Model Should I Choose? Model Overload — Which NLP Model Should I Choose?
As I’m writing this, the model library on Huggingface consists of 11,256 models, and by the time you’re reading this, this number will... Model Overload — Which NLP Model Should I Choose?

As I’m writing this, the model library on Huggingface consists of 11,256 models, and by the time you’re reading this, this number will only have increased. With so many models to choose from, it is no wonder that many get overwhelmed and don’t know any more which model to choose for their NLP tasks.

It’d be great if there was a convenient way to try out different models for the same task and compare those models against each other on a variety of metrics. Sagemaker Experiments does exactly that: It lets you organize, track, compare, and evaluate NLP models very easily. In this article we will pit two NLP models against each other and compare their performances.

All the code is available in this Github repository.

Data Preparation

The data preparation for this project article can be found in this Python script. We will use the IMDB dataset from Huggingface, which is a dataset for binary sentiment classification. The data preparation is pretty standard, the only thing to note is that we need to tokenize the data for each model separately. We will then store the data in S3 folders, one per model.

The models we are comparing in this article will be distilbert-base-uncased and distilroberta-base. Obviously, Sagemaker Experiments is not limited to two models and actually allows to track and compare several NLP models.

Metric definitions

First, it is important to understand how Sagemaker Experiments will the metrics which we will then use to compare the models. The values for these metrics are collected from the logs that are produced during model training. This usually means that the training script has to write out these metrics explicitly.

In our example, we will use Huggingface’s Trainer object which will take care of writing the metrics into the log for us. All we have to do is to define the metrics in the training script. The Trainer object will then automatically write them out into the training log (note that the loss metric is written out by default and that all metrics have the prefix “eval_”):

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

Defining the evaluation metrics

Evaluation metrics in the training logs

That means we can capture these metrics during the training job via regular expressions, which we can define as follows:

    {"Name": "test:loss", "Regex": "\'eval_loss\': (.*?),"},
    {"Name": "test:accuracy", "Regex": "\'eval_accuracy\': (.*?),"},
    {"Name": "test:f1", "Regex": "\'eval_f1\': (.*?),"},
    {"Name": "test:precision", "Regex": "\'eval_precision\': (.*?),"},
    {"Name": "test:recall", "Regex": "\'eval_recall\': (.*?),"},

We will pass those to the estimator we will create further down below to capture these metrics, which will allow us to compare the different NLP models.

Running a Sagemaker Experiment

To organize and track the models we need to create a Sagemaker Experiment object:

import boto3
from smexperiments.experiment import Experiment

sm = boto3.client('sagemaker')

nlp_experiment = Experiment.create(
    description="NLP Classification",

Once that is done, we can kick off the training. We use ml.p3.2xlarge for the Sagemaker Training jobs which will complete the fine-tuning in about 30 minutes. Note that we create a Trial object for each training job. These trials get associated with the experiment we created above which will allow us to track and compare the models:

# loop over models
for model_name in model_list:
    trial_name = f"nlp-trial-{model_name}-{int(time.time())}"
    # create a trial that will be attached to the experiment
    nlp_trial = Trial.create(

    hyperparameters = {'epochs': 2,
                       'train_batch_size': 32,
                       'model_name': model_name

    huggingface_estimator = HuggingFace(entry_point='train.py',
                                        hyperparameters = hyperparameters,
    nlp_training_job_name = f"nlp-training-job-{model_name}-{int(time.time())}"
    s3_prefix = s3_prefix_orig + model_name
    training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
    test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
        inputs={'train': training_input_path, 'test': test_input_path},
            "TrialName": nlp_trial.trial_name,
            "TrialComponentDisplayName": "Training",

The code above kicks off two training jobs (one for each model) in parallel. However, if that is not possible on your account (maybe the number of training job instances is restricted in your AWS account), you can also run these training jobs sequentially. As long as they get associated with the same experiment via the Trial object you will be able to evaluate and compare the models.

Comparing the models

After around 30 mins both models have been trained and it is time to retrieve the results:

from sagemaker.analytics import ExperimentAnalytics

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=Session(sess, sm), 

df_results = trial_component_analytics.dataframe()

The resulting dataframe holds all the information required to compare the two models. For example, we can retrieve the average values for all the metrics we defined like this:

We can see that distilroberta-base performed slightly better with respect to recall and distilbert-base-uncased performed better with respect to F1 scoreprecision, and accuracy. There are many more columns in the dataframe which I will leave to the reader to explore further.


In this article we have created a Sagemaker Experiment to track and compare NLP models. We have created Trials for each of the models and collected various evaluation metrics. After the models have been fine-tuned we were able to access these metrics via a Pandas dataframe and compare the models in a convenient way.

Article originally posted here by Heiko Hotz. Reposted with permission.

About the author: Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning at AWS with over 20 years of experience in the technology sector. He focuses on Natural Language Processing (NLP) and helps AWS customers to be successful on their NLP journey.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.