It’s always good to start a blog post with a joke (even if it’s not a very good one):
Why is this funny? In English, “well” can refer to a state of being and a device for retrieving water. The device in the second panel says “I’m well”, which is both an answer to the question “How are you” and pointing out that the device is, in fact, a well. It’s all about context!
If a Natural Language Processing (NLP) system does not have that context, we’d expect it not to get the joke. However, modern NLP systems like ChatGPT have a pretty good sense of humor:
That’s because modern NLP systems create contextualized representations of input text. In my previous blog post, I talked through three approaches to sentiment analysis (i.e. identifying the “emotional tone” of a particular document). These approaches were all based on a technique called “bagging”; the process of splitting documents into a collection of words (which we’ll refer to as “tokens”).
None of the approaches I covered (as implemented) capture the kind of context we’d need to understand the “well” joke. Each has a single representation for the word “well”, which combines the information for “doing well” with “wishing well”.
In this post, I’ll be demonstrating two deep learning approaches to sentiment analysis. Deep learning refers to the use of neural network architectures, characterized by their multi-layer design (i.e. “deep” architecture). I’ll be making use of the powerful SpaCy library which makes swapping architectures in NLP pipelines a breeze. This is a preview of my upcoming tutorial in May at ODSC East.
Follow along in the notebook!
Introduction to SpaCy
SpaCy is a python library designed to provide a “complete” NLP pipeline, including ingestion, tokenization, tagging, representation, and even classification. This diagram I think gives you a good overview:
Above you can see that text is processed by a “Language” object, which has a number of components such as part-of-speech tagging, vector representations, and models for categorization. These can be customized and trained. Raw text is fed into the Language object, which produces a Doc object. Docs are composed of Spans, which are made up of individual Tokens. Docs, Spans and Tokens all have attributes such as “vector” which correspond to the customized components. We’ll be mainly using the “.cats” component of Docs, for which we’ll be training a text categorization model to classify sentiment as “positive” or “negative.”
Behind each spaCy Language object is a configuration file. This specifies the components and relevant assets, but also provides training parameters for training models behind these components. For a more comprehensive view of the different components see the documentation. We’ll focus on a few sections that are most relevant here. You can see the full configuration on the github repository:
- nlp: This defines the pipeline for the Language object (see above). We’ll be specifying a “textcat” component, the “model” that will process text into spaCy Doc objects.
- components: This section details the components we specified in the nlp section. In the first example, we’ll be defining an architecture based on a Convolutional Neural Network (CNN)
We’ll be using the same dataset as last time; a collection of 50k reviews from IMDB which are labeled as either positive or negative. This is a reasonably clean dataset and a fairly straightforward binary objective. The real world is usually not going to be so clean nor straightforward. I do some manipulations, which you will see in the notebook.
Convolutional Neural Network for sentiment analysis
A CNN model is a type of neural architecture that is based on learned matrices of numbers (filters) that slide (convolve) over the input data. These filters cover a “region” of the input as they move across it, which means their output includes context. That’s useful with images, where a particular filter might highlight “edges” based on how it weights changes in pixel intensity (e.g. where a black line interrupts a white background). But this contextualized representation is also useful for text, as discussed above.
The first approach we’ll tackle makes use of this architecture as implemented in spaCy. A rough outline of how this is implemented is below. A more deep explanation of this architecture is described in this video:
Text is converted to embeddings, which are then fed into a four-layer CNN. Each layer scans over the previous layer’s output, including context on either side of each token. Since each layer takes token-level representations from either side of a “central” token, by the fourth layer each token representation includes some amount of context from four tokens on either side of it. This whole flow is referred to as the “tok2vec” component of the pipeline.
For doing the actual document-level categorization, this “contextualized representation” is then mean-aggregated and passed to a classification layer that predicts the category. In our sentiment analysis example, the two categories are “positive” or “negative”.
You can see the implementation of this in the configuration file on github (abridged below):
Note: This is simplified in more recent versions of spaCy.
We can then train the model using spaCy’s training workflow. The training will by default output the model loss and accuracy on the validation set as the model learns patterns in the data.
Once the model is finished, we can use it like any other spaCy pipeline, except now the “.cat” attribute is populated with the output of our CNN model. A simple evaluation workflow shows that the default model does well, achieving 83% accuracy. We could spend a lot of time tweaking parameters here, but since everyone seems so hot and bothered about Transformer models these days, why don’t we jump into that?
Transformers for sentiment analysis (Alt title: Autobots Roll Out!)
Sorry – I couldn’t resist. One of the weaknesses of many neural models for language is that there are long-term dependencies in language. Think of the phrase “The movie I watched today was really bad”. “Movie” and “bad” are 5 tokens apart (assuming simple whitespace-split), which means that even with the context included in our CNN model, it’s unlikely that the representation for “movie” will include any information from “bad”.
If you think about how you, a very complex NLP system, would process that sentence, you’d know that “bad” refers to the movie. You naturally link those two words, as you do with “the” and “movie”. This is (roughly) the idea behind a mechanism called “attention”. Each token representation includes its relationship to other tokens.
Transformer models rely on this mechanism to overcome (in some situations) the limits of other neural models. There’s a lot more to these models, but we’ll save that discussion for another time. Beyond just the architecture improvements, these models have been used to great effect in transfer learning. By training a transformer model to predict a word given the word’s context, it internalizes a lot of general language patterns (though it does not necessarily “understand” them).
Thanks to all the hard-working contributors to HuggingFace’s transformers library and spaCy, we get to leverage these “pre-trained” models for our adorable little project. With some minor modifications to the CNN config file, we pull in a Transformer model (DistilBERT) and use it as our “tok2vec” component. The next step is roughly the same, condensing the input to a document representation and passing that through a classification layer. Training proceeds roughly the same, though this likely will take longer (and eat up more compute).
By the end, we see a 6% improvement in performance. Remember, this is using mostly default parameters. It would not be surprising if you could push performance even higher. I’d be interested to hear if you do!
All that chatters is not gold
I always find myself thinking about this visual from the NeurIPS paper by Google Hidden Technical Debt in Machine Learning Systems:
The key thing I see here is how small a fraction of the entire system the model is. Looking at the changes we made to the spaCy configuration to leap forward roughly a decade in NLP development brings that point home. Adding complexity to a pipeline is pretty easy, accounting for that complexity in a production system is the stuff of MLOps nightmares.
Some of the bagging techniques in the previous blog post achieved the same performance as this transformer-based one (though they used MANY features). Does an extra few percentage points increase in performance merit the additional complexity? In some cases, yes. But I’d encourage folks to ask that question before they jump to the newest and shiniest of approaches. Transformers are fascinating and powerful technologies. But, personally, I wouldn’t start with them.
If you liked this and want to dive in deeper, join me and others at ODSC East in May! I welcome your comments on this work and anything I’ve discussed. Looking forward to May and seeing you all soon.
About the author/ODSC East 2023 speaker:
Ben is a Senior Data Scientist at the Institute for Experiential AI at Northeastern University. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia, and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has presented his work at conferences, published articles, taught courses in data science and NLP, and is co-organizer of the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.