Teaser: Training a model to summarize Github Issues
The above results are randomly selected elements of a holdout set. Keep reading below, there will be a link to many more examples!
I never imagined I would ever use the word “magical” to describe the output of a machine learning technique. This changed when I was introduced to deep learning, where you can accomplish things like identify objects in pictures or sort two tons of legos. What is more amazing is you do not need a PhD or years of training to unleash the power of these techniques on your data. You just need to be comfortable with writing code, high school level math, and patience.
However, there is a dearth of reproducible examples of how deep learning techniques are being used in industry. Today, I’m going to share with you a reproducible, minimally viable product that illustrates how to to utilize deep learning to create data products from text (Github Issues).
This tutorial will focus on using sequence to sequence models to summarize text found in Github issues, and will demonstrate the following:
- You don’t need to have tons of computing power to achieve sensible results (I am going to use a single GPU).
- You don’t need to write lots of code. It’s surprising how so few lines of code can produce something so magical.
- Even if you do not want to summarize text, training a model to accomplish this task is a useful for generating features for other tasks.
What I’m going to cover in this post:
- How to gather the data and prepare it for deep learning.
- How to construct the architecture of a seq2seq model and train the model.
- How to prepare the model for inference, and a discussion and demonstration of various use cases.
My goal is to focus on providing you with an end-to-end example so that you can develop a conceptual model of the workflow, rather than diving very deep into the math. I will provide links along the way to allow you to dig deeper if you so desire.
Get The Data
If you are not familiar with Github Issues, I highly encourage you to go look at a few before diving in. Specifically, the pieces of data we will be using for this exercise are Github Issue bodies and titles. An example is below:
We will gather many (Issue Title, Issue Body) pairs with the goal of training our model to summarize issues. The idea is that by seeing many examples of issue descriptions and titles a model can learn how to summarize new issues.
The best way to acquire Github data if you do not work at Github is to utilize this wonderful open source project, which is described as:
…. a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
Instructions on querying data from this project is available in the appendix of this article. An astute reader of this blog (David Shinn) has gone through all of the steps outlined in the appendix and has hosted the data required for this exercise as a csv file on Kaggle!
You can download the data from this page by clicking on the download link.
Prepare & Clean The Data
Keras Text Pre-Processing Primer
Now that we have gathered the data, we need to prepare the data for the modeling. Before jumping into the code, let’s warm up with a toy example of two documents:
[“The quick brown fox jumped over the lazy dog 42 times.”, “The dog is lazy”]
Below is a rough outline of the steps I will take in order to pre-processes this raw text:
1. Clean text: in this step, we want to remove or replace specific characters and lower case all the text. This step is discretionary and depends on the size of the data and the specifics of your domain. In this toy example, I lower-case all characters and replace numbers with *number* in the text. In the real data, I handle more scenarios.
[“the quick brown fox jumped over the lazy dog *number* times”, “the dog is lazy”]
3. Tokenize: split each document into a list of words
[[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘*number*’, ‘times’], [‘the’, ‘dog’, ‘is’, ‘lazy’]]
4. Build vocabulary: You will need to represent each distinct word in your corpus as an integer, which means you will need to build a map of token -> integers. Furthermore, I find it useful to reserve an integer for rare words that occur below a certain threshold as well as 0 for padding (see next step). After you apply a token -> integer mapping, your data might look like this:
[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [2, 9, 12, 8]]
5. Padding: You will have documents that have different lengths. There are many strategies on how to deal with this for deep learning, however for this tutorial I will pad and truncate documents such that they are all transformed to the same length for simplicity. You can decide to pad (with zeros) and truncate your document at the beginning or end, which I will refer to as “pre” and “post” respectively. After pre-padding our toy example, the data might look like this:
[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [0, 0, 0, 0, 0, 0, 0, 2, 9, 12, 8]]
A reasonable way to decide your target document length is to build a histogram of document lengths and choose a sensible number. (Note that the above example has padded the data in front but we could also pad at the end. We will discuss this more in the next section).
Preparing Github Issues Data
For this section, you will want to follow along in this notebook. The data we are working with looks like this:
We can see there are issue titles and bodies, which we will process separately. I will not be using the URLs for modeling but only as a reference. Note that I have sampled 2M issues from the original 5M in order to make this tutorial tractable for others.
Personally, I find pre-processing text data for deep learning to be extremely repetitive. Keras has good utilities that allow you to do this, however I wanted to parallelize these tasks to increase speed.
The ktext package
I have built a utility called ktext that helps accomplish the pre-processing steps outlined in the previous section. This library is a thin wrapper around keras and spacy text processing utilities, and leverages python process-based-threading to speed things up. It also chains all of the pre-processing steps together and provides a bunch of convenience functions. Warning: this package is under development so use with caution outside this tutorial (pull requests are welcome!). To learn more about how this library works, look at this tutorial (but for now I suggest reading ahead).
To process the body data, we will execute this code:
See full code on in this notebook.
The above code cleans, tokenizes, and applies pre-padding and post-truncating such that each document length is 70 words long. I made decisions about padding length by studying histograms of document length provided by ktext. Furthermore, only the top 8,000 words in the vocabulary are retained and remaining words are set to the index 1 which correspond to rare words (this was an arbitrary choice). It takes one hour for this to run on an AWS p3.2xlarge instance that has 8 cores and 60GB of memory. Below is an example of raw data vs. processed data:
The titles will be processed almost the same way, but with some subtle differences:
See full code in this notebook.
This time, we are passing some additional parameters:
- append_indicators=True will append the tokens ‘_start_’ and ‘_end_’ to the start and end of each document, respectively.
- padding=’post’ means that zero padding will be added to the end of the document instead of default of ‘pre’.
The reason for processing the titles in this way is that we want our model to know when the first letter of the title is supposed to occur, and also learn to predict when the end of a phrase should be. This will make more sense in the next section where model architecture is discussed.
Define The Model Architecture
Building a neural network architecture is like stacking lego bricks. For beginners, it can be useful to think of each layer as an API: you send the API some data and then the API returns some data. Thinking of things this way frees you from becoming overwhelmed, and you can build your understanding of things slowly. It is important to understand two concepts:
- the shape of data that each layer expects, and the shape of data the layer will return. (When you stack many layers on top of each other, the input and output shapes must be compatible, like legos).
- conceptually, what will the output(s) of a layer represent? What does the output of a subset of stacked layers represent?
The above two concepts are essential to understanding this tutorial. If you don’t feel comfortable with this as you are reading below, I highly recommend watching as many lessons as you need from this MOOC and returning here.
In this tutorial, we will leverage an architecture called a Sequence to Sequence networks. Pause reading this blog and carefully read A ten-minute introduction to sequence-to-sequence learning in Keras by Francois Chollet.
Once you finish reading that article, you should conceptually understand the below diagram, which illustrates a network that will take two inputs and have one output:
The network we will use for this problem will look very similar to the one in the tutorial described above, and is defined with this code:
For more context, see this notebook.
When you read the above code, you will notice references to the concept of teacher forcing. Teacher forcing is an extremely important mechanism that allows this network to train faster. This is explained better in this post.
You may be wondering — where I came up with the above architecture. I started with publicly available examples and performed lots of experiments. This xkcd comic really describes it best. You will notice that my loss function is sparse categorical crossentropy instead of categorical crossentropy because this allows me to pass integers as my targets to predict instead of one-hot-encoding my targets, which is more memory efficient.