Setting up a Text Summarization Project Setting up a Text Summarization Project
When OpenAI released the third generation of their machine learning (ML) model that specializes in text generation in July 2020, I knew... Setting up a Text Summarization Project

When OpenAI released the third generation of their machine learning (ML) model that specializes in text generation in July 2020, I knew something was different. This model struck a nerve like no one that came before it. Suddenly I heard friends and colleagues, who might be interested in technology but usually don’t care much about the latest advancements in the AI/ML space, talk about it. Even the Guardian wrote an article about it. Or, to be precise, the model wrote the article and the Guardian edited and published it. There was no denying it — GPT-3 was a game-changer.

Once the model had been released, people immediately started to come up with potential applications for it. Within weeks plenty of impressive demos were created, which can be found on the Awesome GPT-3 website. One particular application that caught my eye was text summarization, i.e. the capability of a computer to read a given text and summarise its content. It combines two fields within the field of Natural Language Processing (NLP), reading comprehension and text generation, and is one of the hardest tasks for a computer. This is why I was so impressed by the GPT-3 demos for text summarization.

You can give them a try on the Hugging Face Spaces website. My favorite one at the moment is an application that generates summaries of news articles with just the URL of the article as input.

What is this tutorial about?

Many organizations I work with (charities, companies, NGOs) have huge amounts of texts they need to read and summarise — financial reports or news articles, scientific research papers, patent applications, legal contracts, etc. Naturally, these organizations are interested in automating these tasks with NLP technology. So, in order to demonstrate the art of the possible, I often use the text summarisation demos and they almost never fail to impress.

But now what?

The challenge for these organizations is that they want to assess text summarization models based on summaries for many, many documents — not one at a time. They don’t want to hire an intern whose only job is to open the application, paste in a document, hit the “Summarise” button, wait for the output, assess whether the summary is good, and do that all over again for thousands of documents.

This brings us to the objective of this blog post series: In this tutorial, I propose a practical guide for organizations so they can assess the quality of text summarization models for their domain.

Who is this tutorial (not) for?

I wrote this tutorial with my past self from four weeks ago in mind, i.e. it’s the tutorial I wish I had back then when I started on this journey. In that sense, the target audience of this tutorial is someone who is familiar with AI/ML and has used Transformer models before, but is at the beginning of their text summarisation journey and want to dive deeper into it. Because it’s written by a “beginner” for beginners I want to stress the fact that this tutorial is a practical guide — not THE practical guide. Please treat it as if George Box had said:

Image by author

In terms of how much technical knowledge is required in this tutorial: It does involve some coding in Python, but most of the time we will just use the code to call APIs, so no deep coding knowledge is required, either. It will be useful to be familiar with certain concepts of machine learning, e.g. what it means to train and deploy a model, the concepts of training, validation, and test datasets, and so on. Also having dabbled with the transformers library before might be useful, as we will use this library extensively throughout this tutorial. That all being said I will try to include useful links for further reading for these concepts, if I don’t forget it 😉

Because this tutorial is written by a beginner, I don’t expect NLP experts and advanced deep learning practitioners to get much of this tutorial. At least not from a technical perspective — you might still enjoy the read, though, so please don’t leave just yet! But you will have to be patient with regards to my simplifications — I tried to live by the concept of making everything in this tutorial as simple as possible, but not simpler.

Structure of this tutorial

This series will stretch over five parts in which we will go through different stages of a text summarisation project. In the first part, we will start by introducing a metric for text summarization tasks, i.e. a measure of performance that will allow us to assess whether a summary is “good” or “bad”. We will also introduce the dataset we want to summarise and create a baseline using a no-ML “model”, i.e. we will use a simple heuristic to generate a summary from a given text. Creating this baseline is a vitally important step in any ML project because it will enable us to quantify how much progress we make by using AI going forward, i.e. it allows us to answer the question “Is it really worth investing in AI technology?”

In the next part (part 2) we will use a model that already has been pre-trained to generate summaries. This is possible of a modern approach in ML called Transfer Learning. You can read more about it in this paper. This is another useful step because we basically take a model off-the-shelf and test it on our dataset. This allows us to create another baseline which will be useful to see what happens when we actually train the model on our dataset. This approach is called zero-shot summarization because the model has had zero exposure to our dataset.

After that, it is time to use a pre-trained model and train it on our own dataset (part 3). This is also called fine-tuning. It will enable the model to learn from the patterns and idiosyncrasies of our data and slowly adapt to it. Once we have trained the model we will use it to create summaries (part 4).

So, just to summarise (see what I did there?):

  • Part 1: Using a no-ML “model” to establish a baseline
  • Part 2: Generating summaries with a zero-shot model
  • Part 3: Training a summarization model
  • Part 4: Evaluating the trained model

What will we have achieved by the end of this tutorial?

Now is the time for a brutal reality check, I’m afraid: By the end of this tutorial, we will not have a text summarization model that can be used in production. We won’t even have a good summarisation model (insert scream emoji here)!

What we will have instead is a starting point for the next phase of the project, which is the experimentation phase. This is now where the science in data science comes in, because now it’s all about experimenting with different models and different settings to understand whether a good enough summarisation model can be trained with the available training data.

And, to be completely transparent, there is a good chance that the conclusion will be that the technology is just not ripe yet and that the project will not be implemented. And you have to prepare your business stakeholders for that possibility. But that’s a story for another blog post 😉

Part 1 — Creating a baseline

This is the first part of a tutorial on setting up a text summarisation project. For more context and an overview of this tutorial, please refer back to the introduction.

In this part, we will establish a baseline using a very simple “model”, without actually using machine learning (ML). This is a very important step in any ML project, as it allows us to understand how much value ML adds over the time of the project and if it’s worth investing in it.

The code for the tutorial can be found in this Github repo.

Data, data, data …

Every ML project starts with data! If possible, we always should use data related to what we want to achieve with a text summarisation project. For example, if our goal is to summarise patent applications we should also use patent applications to train the model. A big caveat for an ML project is that the training data usually needs to be labeled. In the context of text summarization, that means we need to provide the text to be summarised as well as the summary (the “label”). Only by providing both can the model learn what a “good” summary looks like.

In this tutorial, we will use a publicly available dataset, but the steps and code remain exactly the same if we used a custom/private dataset. And again, if you have an objective in mind for your text summarization model and have corresponding data, please use your data instead to get the most out of this.

The data we will use is the arXiv dataset which contains abstracts of arXiv papers as well as their titles. For our purpose we will use the abstract as the text we want to summarise and the title as the reference summary. All the steps of downloading and pre-processing the data can be found in this notebook. The dataset was developed as part of this paper and is licensed under the Creative Commons CC0 1.0 Universal Public Domain Dedication.

Note that the data is split into three datasets, training, validation, and test data. If you’d like to use your own data, make sure this is the case too. Just as a quick reminder, this is how we will use the different datasets:

Image by author

Naturally, a common question at this point is: How much data do we need? And, as you can probably already guess, the answer is: It depends. It depends on how specialized the domain is (summarising patent applications is quite different from summarising news articles), how accurate the model needs to be useful, how much the training of the model should cost, etc. We will return to this question at a later point when we actually train the model, but the short of it is that we will have to try out different dataset sizes once we are in the experimentation phase of the project.

What makes a good model?

In many ML projects, it is rather straightforward to measure a model’s performance. That’s because there is usually little ambiguity around whether the model’s result is correct. The labels in the dataset are often binary (True/False, Yes/No) or categorical. In any case, it’s easy in this scenario to compare the model’s output to the label and mark it as correct or incorrect.

When generating text this becomes more challenging. The summaries (the labels) we provide in our dataset are only one way to summarise text. But there are many possibilities to summarise a given text. So, even if the model doesn’t match our label 1:1, the output might still be a valid and useful summary. So how do we compare the model’s summary with the one we provide? The metric that is used most often in text summarization to measure the quality of a model is the ROUGE score. To understand the mechanics of this metric I recommend this blog post. In summary, the ROUGE score measures the overlap of n-grams (contiguous sequence of n items) between the model’s summary (candidate summary) and the reference summary (the label we provide in our dataset). But, of course, this is not a perfect measure and to understand its limitations, I quite like this post.

So, how do we calculate the ROUGE score? There are quite a few Python packages out there to compute this metric and to ensure consistency, we should use the same method throughout our project. Because we will, at a later point in this tutorial, be quite l̶a̶z̶y̶ smart and use a training script from the Transformers library instead of writing our own, we can just peek into the source code of the script and copy the code that computes the ROUGE score:

from datasets import load_metric
metric = load_metric("rouge")

def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

By using this method to compute the score we ensure that we always compare apples to apples throughout the project.

Note that this function will compute several ROUGE scores: rouge1rouge2rougeL, and rougeLsum (The “sum” in rougeLsum refers to the fact that this metric is computed over a whole summary, while rougeL is computed as the average over individual sentences). So, which ROUGE score we should use for our project? Again, we will have to try different approaches in the experimentation phase. For what it’s worth, the original ROUGE paper states that “ROUGE-2 and ROUGE-L worked well in single document summarization tasks” while “ROUGE-1 and ROUGE-L perform great in evaluating short summaries”.

Creating the baseline

Next up we want to create the baseline by using a simple, no-ML model. What does that mean? Well, in the field of text summarization, many studies use a very simple approach: They take the first n sentences of the text and declare it the candidate summary. They then compare the candidate summary with the reference summary and compute the ROUGE score. This is a simple yet powerful approach that we can implement in a few lines of code (the entire code for this part can be found in this notebook):

import re

ref_summaries = list(df_test['summary'])

for i in range (3):
    candidate_summaries = list(df_test['text'].apply(lambda x: ' '.join(re.split(r'(?<=[.:;])\s', x)[:i+1])))
    print(f"First {i+1} senctences: Scores {calc_rouge_scores(candidate_summaries, ref_summaries)}")

Note that we use the test dataset for this evaluation. This makes sense because once we train the model we will also use the same test dataset for final evaluation. We also try different numbers for n, i.e. we start with only the first sentence as candidate summary, then the first two sentences, and finally the first three sentences.

And these are the results for our first “model”:

Image by author

We can see that the scores are highest with only the first sentence as the candidate summary. This means that taking more than one sentence makes the summary to verbose and leads to a lower score. So that means we will use the scores for the one-sentence summaries as our baseline.

It’s important to note that, for such a simple approach, these numbers are actually quite good, especially for the rouge1 score. To put these numbers in context we can check this page, which shows the scores of a state-of-the-art model for different datasets.

Conclusion and what’s next

We have introduced the dataset which we will use throughout the summarisation project as well as a metric to evaluate summaries. We then created the following baseline with a simple, no-ML model:

Image by author

In the next part, we will be using a zero-shot model, i.e. a model that has been specifically trained for text summarization on public news articles. However, this model won’t be trained at all on our dataset (hence the name “zero-shot”).

I will leave it to you as homework to guess how this zero-shot model will perform compared to our very simple baseline. On the one hand, it will be a much more sophisticated model (it’s actually a neural network), on the other, it’s only used to summarise news articles, so it might struggle with the patterns that are inherent to the arXiv dataset.

Part 2 — Zero-shot learning

This is the second part of a tutorial on setting up a text summarisation project. For more context and an overview of this tutorial, please refer back to the introduction as well as part 1 in which we created a baseline for our project.

In this blog post, we will leverage the concept of zero-shot learning (ZSL) which means we will use a model that has been trained to summarise text but hasn’t seen any examples of the arXiv dataset. It’s a bit like trying to paint a portrait when all you have been doing in your life is landscape painting. You know how to paint, but you might not be too familiar with the intricacies of portrait painting.

The code for the entire tutorial can be found in this Github repo. For today’s part, we will use this notebook, in particular.

Why Zero-Shot Learning (ZSL)?

ZSL has become popular over the past years because it allows leveraging state-of-the-art NLP models with no training. And their performance is sometimes quite astonishing: The Big Science Research Workgroup has recently released their T0pp (pronounced “T Zero Plus Plus”) model, which has been trained specifically for researching zero-shot multitask learning. It can often outperform models 6x larger on the BIG-bench benchmark, and can outperform the 16x larger GPT-3 on several other NLP benchmarks.

Another benefit of ZSL is that it takes literally two lines of code to use it. By just trying it out we can create a second baseline, which we can use to quantify the gain in model performance once we fine-tune the model on our dataset.

Setting up a zero-shot learning pipeline

To leverage ZSL models we can use Hugging Face’s Pipeline API. This API enables us to use a text summarization model with just two lines of code while it takes care of the main processing steps in an NLP model:

  1. The text is preprocessed into a format the model can understand.
  2. The preprocessed inputs are passed to the model.
  3. The predictions of the model are post-processed, so you can make sense of them.

It leverages the summarisation models that are already available on the Hugging Face model hub.

So, here’s how to use it:

from transformers import pipeline

summarizer = pipeline("summarization")

That’s it, believe it or not. This code will download a summarization model and create summaries locally on your machine. In case you’re wondering which model it uses, you can either look it up in the source code or use this command:


When we run this command we see that the default model for text summarization is called sshleifer/distilbart-cnn-12-6:

Image by author

We can find the model card for this model on the Hugging Face website, where we can also see that the model has been trained on two datasets: The CNN Dailymail dataset and the Extreme Summarization (XSum) dataset. It is worth noting that this model is not familiar with the arXiv dataset and is only used to summarise texts that are similar to the ones it has been trained on (mostly news articles). The numbers 12 and 6 in the model name refer to the number of encoder layers and decoder layers, respectively. Explaining what these are is outside the scope of this tutorial, but you can read more about it in this blog post by Sam Shleifer, who created the model.

We will use the default model going forward, but I encourage you to try out different pre-trained models. All the models that are suitable for summarization can be found here. To use a different model you can specify the model name when calling the Pipeline API:

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

We haven’t spoken yet about two possible but different approaches to text summarization: Extractive vs Abstractive. Extractive summarization is the strategy of concatenating extracts taken from a text into a summary, while abstractive summarisation involves paraphrasing the corpus using novel sentences. Most of the summarisation models are based on models that generate novel text (they are Natural Language Generation models, like, for example, GPT-3). This means that the summarisation models will also generate novel text, which makes them abstractive summarization models.

Generating zero-shot summaries

Now that we know how to use it, we want to use it on our test dataset, exactly the same dataset we used in part 1 to create the baseline. We can do that with this loop:

candidate_summaries = []

for i, text in enumerate(texts):
    if i % 100 == 0:
    candidate = summarizer(text, min_length=5, max_length=20)

Note that we have the min_length and max_length parameters to control the summary the model generates. In this example, we set min_length to 5 because we want the title to be at least 5 words long. And by eye-balling the reference summaries (i.e. the actual titles for the research papers) it looks like 20 could be a reasonable value for max_length. But again, this is just a first attempt and once the project is in the experimentation phase, these two parameters can and should be changed to see if the model performance changes.

If you’re already familiar with text generation you might know there are many more parameters to influence the text a model generates, such as a beam search, sampling, and temperature. These parameters give you more control over the text that is being generated, for example make the text more fluent, less repetitive, etc. These techniques are not available in the Pipeline API — you can see in the source code that min_length and max_length are the only parameters that will be considered. Once we train and deploy our own model, however, we will have access to those parameters. More on that in part 4 of this series.

Model evaluation

Once we have generated the zero-shot summaries, we can use our ROUGE function again to compare the candidate summaries with the reference summaries:

from datasets import load_metric
metric = load_metric("rouge")

def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

Running this calculation on the summaries that were generated with the ZSL model, we get the following results:

Image by author

When we compare those with our baseline from part 1, we see that this ZSL model is actually performing worse than our simple heuristic of just taking the first sentence. Again, this is not unexpected: While this model knows how to summarise news articles, it has never seen an example of summarising the abstract of an academic research paper.


We now have created two baselines, one using a simple heuristic and one with a ZSL model. By comparing the ROUGE scores we see that the simple heuristic currently outperforms the deep learning model:

Image by author

In the next part we will take this very same deep learning model and try to improve its performance. We will do so by training it on the arXiv dataset (this step is also called fine-tuning): We leverage the fact that it already knows how to summarise text in general. We then show it lots of examples of our arXiv dataset. Deep learning models are exceptionally good at identifying patterns in datasets once they get trained on it, so we do expect the model to get better at this particular task.

Part 3 — Training a Summarisation Model

In this part we will train the model we used for zero-shot summaries in part 2 (sshleifer/distilbart-cnn-12-6) on our dataset. The idea is to teach the model what summaries for abstracts of research papers look like by showing it many examples. Over time the model should recognize the patterns in this dataset which will allow it to create better summaries.

It is worth noting once more that if you have labeled data, i.e. texts and corresponding summaries, you should use those to train a model. Only by doing so can the model learn the patterns of your specific dataset.

SageMaker training jobs

Because training a deep learning model would take a few weeks on my laptop, we will leverage SageMaker training jobs instead. You can learn all about training jobs in this documentation, but I want to briefly highlight the advantage of using these training jobs, besides the fact that they allow us to use GPU compute instances.

So, let’s assume we have a cluster of GPU instances we could use. In that case we would likely want to create a Docker image to run the training so that we can easily replicate the training environment on other machines. We would then install the required packages and because we want to use several instances we need to set up distributed training as well. Once the training is done we want to quickly shut down these computers because they are costly.

All these steps are abstracted away from us when using training jobs. In fact, we can train a model in the same way as described above by specifying the training parameters and then just calling one method. SageMaker will take care of the rest, including terminating the GPU instances once the training is completed so to not incur any further costs.

In addition, Hugging Face and AWS have announced a partnership earlier this year that makes it even easier to train Hugging Face models on SageMaker. We can find many examples of how to do so in this Github repo.

Setting up the training job

In fact, we will use one of those examples as a template because it almost does everything we need for our purpose: Training a summarisation model on a specific dataset in a distributed manner (i.e. using more than one GPU instance).

One thing, however, we have to account for is that this example uses a dataset directly from HF dataset hub. Because we want to provide our own custom data we need to amend the notebook slightly.

To account for the fact that we bring our own dataset we need to leverage channels. You can find more about them in this documentation.

Now, I personally find this term a bit confusing, so in my mind I always think mapping when I hear channels, because it helps me better visualize what happens. Let me try to explain: As we have already learned, the training job spins up a cluster of EC2 instances and copies a Docker image onto it. However, our datasets live in S3 and cannot be accessed by that Docker image. Instead, the training job needs to copy the data from S3 into a pre-defined path “locally” into that Docker image. The way it does that is by us telling the training job where the data sits in S3 and where on the docker image the data should be copied into so that the training job can access it. We map the S3 location with the local path.

We set the local path in the hyperparameters section of the training job:

Image by author

And then we tell the training job where the data resides in S3 when calling the fit() method which will start the training:

Image by author

Note that the folder name after /opt/ml/input/data matches the channel name (datasets). This enables the training job to copy the data from S3 to the local path.

Starting the training

Once we have done that we can start the training job. As mentioned before, this is done by calling the fit() method. The training job will run for about 40 minutes and you can follow the progress and see additional information in the console:

Image by author

The complete code for the model training is in this notebook. Once the training job has finished it’s time to evaluate our newly trained model.

Part 4 — Model Evaluation

Evaluating our trained model is very similar to what we have done in part 2 where we evaluated the ZSL model: We will call the model and generate candidate summaries and compare them to the reference summaries by calculating the ROUGE scores. But right now the model sits in S3 in a file called model.tar.gz (to find the exact location you can check the training job in the console). So how do we access the model to generate summaries?

Well, we have two options: We can either deploy the model to a SageMaker endpoint or download it locally similar to what happened in part 2 with the ZSL model. In this tutorial I choose to deploy the model to a SageMaker endpoint because it is more convenient and by choosing a more powerful instance for the endpoint we can shorten the inference time significantly. That being said, in the Github repo you will also find a notebook that shows how to evaluate the model locally.

Deploying a model

It’s usually very easy to deploy a trained model on SageMaker, see again this example from Hugging Face. Once the model has been trained, we can just call estimator.deploy() and SageMaker does the rest for us in the background. Because in our tutorial we switch from one notebook to the next, we have to locate the training job and the associated model first, before we can deploy it:

Image by author

Once we have retrieved the model location, we can deploy it to a SageMaker endpoint:

from sagemaker.huggingface import HuggingFaceModel

model_for_deployment = HuggingFaceModel(entry_point='inference.py',

predictor = model_for_deployment.deploy(initial_instance_count=1,

Deployment on SageMaker is straightforward because it leverages the SageMaker Hugging Face Inference Toolkit, an open-source library for serving Transformers models on Amazon SageMaker. We normally don’t even have to provide an inference script, the toolkit takes care of that. In that case, however, the toolkit utilizes the Pipeline API again, and as we have discussed in part 2, the Pipeline API doesn’t allow us to use advanced text generation techniques such as beam-search and sampling. To avoid this limitation we provide our custom inference script.

First evaluation

For the first evaluation of our newly trained model we will use the same parameters as in part 2 with the zero-shot model to generate the candidate summaries. This allows to make an apple-to-apples comparison:

candidate_summaries = []

for i, text in enumerate(texts):
    data = {"inputs":text, "parameters_list":[{"min_length": 5, "max_length": 20}]}
    candidate = predictor.predict(data)

Comparing the summaries generated by the model with the reference summaries:

This is encouraging! Our first attempt to train the model, without any hyperparameter tuning, has improved the ROUGE scores significantly:

Image by author

Second evaluation

Now it’s finally time to use some more advanced techniques such as beam-search and sampling to play around with the model. You can find detailed explanation what each of these parameters do in this excellent blog post. So let’s try it with a semi-random set of values for some of these parameters:

candidate_summaries = []

for i, text in enumerate(texts):
    data = {"inputs":text,
            "parameters_list":[{"min_length": 5, "max_length": 20, "num_beams": 50, "top_p": 0.9, "do_sample": True}]}
    candidate = predictor.predict(data)

When running our model with these parameters, we get the following scores:

Image by author

So that didn’t work out quite as we hoped, the ROUGE scores have actually gone down slightly. However, don’t let this discourage you from trying out different values for these parameters. In fact, this is the point where we finish with the setup phase and transition into the experimentation phase of the project.

Final Conclusion & next steps

We have concluded the setup for the experimentation phase. We have downloaded and prepared our data, created a first baseline with a simple heuristic, created another baseline using zero-shot learning, and then trained our own model and saw a significant increase in performance. Now it’s time to play around with every part we created in order to create even better summaries. A few ideas you might want to try:

  • Pre-processing the data properly, e.g. removing stop-words, punctuations, etc. Don’t underestimate this part — in many data science project data-preprocessing is one of the (if not THE) most important aspects and data scientists typically spend most of their time with this task.
  • Trying out different models. In our tutorial we used the standard model for summarisation (sshleifer/distilbart-cnn-12-6) but as we know there are many more models out there that can be used for this task. One of those might better fir your use case.
  • Hyperparameter-tuning. When training the model we used a certain set of hyperparameters (learning rate, number of epochs, etc). These parameters are not set in stone, quite the opposite. You want to change these parameters to understand how they affect your model performance.
  • Different parameters for text-generation. We already did one round of creating summaries with different parameters to utilise beam-search and sampling. Try out different values and also different parameters. Refer back to this blog and other sources to understand how they affect the generation of text.

Article originally posted here by Heiko Hotz Reposted with permission.

About the author: Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning at AWS with over 20 years of experience in the technology sector. He focuses on Natural Language Processing (NLP) and helps AWS customers to be successful on their NLP journey.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.