Editor’s note: Ben is a speaker for ODSC West this November 1st-3rd. Be sure to check out his talk, “Bagging to BERT – A Tour of Applied NLP,” there!
Every two days, we generate as much data as was produced from the start of human history to 2003. These data tend to be in unstructured formats like images, text, and video, accounting for roughly 70% of stored digital data according to some sources. However, data is not the same thing as information. In order for these data to support modern AI use cases, they need to be processed carefully and thoughtfully.
This is, in my view, the major goal of Natural Language Processing (NLP). NLP provides a set of techniques to turn raw text data into useful information. The field has been undergoing a Renaissance of sorts in the past decade or so, fueled by the spread of open source technology (e.g. SpaCy), public text datasets (e.g. Common Crawl), and new architectures (e.g. BERT). This has led to large improvements across various NLP applications such as machine translation. Google reports improvements across over 100 languages in the past several years.
Though the State of the Art in NLP is based on large neural architectures, a lot can be accomplished with simple techniques like weighted word counts and topic models. One of my favorite examples of this comes from the book Speech and Language Processing. The authors present a visual of the counts of words in Shakespeare’s plays:
(adapted from chapter 6: Vector Semantics and Embeddings)
“Twelfth Night,” one of Shakespeare’s comedies, contains frequent use of words like “wit” and “fool”, while “Julius Caesar”, one of Shakespeare’s histories, uses words like “battle”. This is a simple example, but this “separability” of the words used could be a powerful set of features for a model trying to classify a text as “comedy” or “history.”
In my upcoming tutorial at ODSC West, I’ll walk through an example sentiment analysis use-case, starting with simple methods like word counts and building to more advanced techniques such as transformer models. In this post, I’ll demonstrate how simple word count and weighted word count techniques can achieve impressive performance on a sentiment analysis task.
For this exercise (also for the tutorial) we will use a collection of 50k reviews from IMDB which are labeled as either positive or negative. These data are reasonably clean and easy to obtain and the binary task of classifying a review as positive or negative is fairly straightforward. It should be noted that real-world problems and datasets are not likely to be either clean, easily obtained, or straightforward.
Additionally, I do some manipulations of this dataset. You can access the GitHub repo to access the code and the data. You can also access this via Google Colab to follow along with the code snippets below.
Word count vectors
One first step we can take with these data is to count the words contained within each review. As shown above, even a visual inspection of these features can yield useful insights. This can actually be accomplished just by using a Counter object from python’s base collections module:
You’ll notice that this approach didn’t apply any preprocessing such as lowercasing. It is just a count of a list of words split on whitespace. Scikit-learn’s CountVectorizer is more performant and allows some more flexibility in terms of how words are split and counted. In this example, we use the default preprocessing and include the removal of stop words like “but” and “the.”
To turn this back into something easier to inspect, we can get the words (here called “features”) out of the vectorizer:
By applying this vectorizer to the entire dataset, we get what can be called a “document-term matrix”; that is, a matrix with each row representing a document and each column representing the number of times a word appears within that document.
Using this matrix, we can create a fairly simplistic “rule-based” word-scoring approach for sentiment analysis. By constructing a list of words that we score as either positive or negative, each document can be given a score based on the count of those words. The result is a document-level score, which we can use for classification. If a document-level score is positive, we’ll mark it as a positive review. If it is negative, we’ll score it as negative. We can set an arbitrary threshold of the mean score as the cut-off between positive and negative reviews. Below, we look at how that performs on a holdout set (30% of the data).
Classification Report for a deterministic approach
This didn’t do great, but this took seconds to run and required no model development or training. What if we actually plug this set of word count “features” into a simple classification model like a Logistic Regression? In this case, rather than a rule-based method, we’re asking the model to detect the relationship between word counts and whether a review is positive or negative based on a subset of the data. We then assess performance on that same holdout set:
Classification Report for Count Vector-based model
Much improved! But can we do better? One thing we see if we look at the document-term matrix is that each word is counted the same. Take as an example some kind of simplistic movie reviews. We can already tell which words are most relevant to the specific content of each review (i.e. “good”, “bad”, “great”).
We see here that in these reviews the more informative words are being counted the same as the less informative words. We might want to use a weighting scheme to ensure that words that are more informative about the content are flagged as more important.
Term Frequency – Inverse Document Frequency (TF-IDF)
TF-IDF is one such weighting scheme. The idea here is that word counts are weighted by how often the word occurs across a set of documents. Words like “the” occur often (high document frequency) while a word like “bad” occurs less often (low document frequency). The inverse of this document frequency will down-weight common words and up-weight uncommon words. You can see how that works with the example above:
You can see here that the informative words (“good”, “bad” and “pretty”) have higher weights than the other words. This may provide more information than raw word counts to a classification model. Let’s try it in our Logistic Regression.
Classification Report for TF-IDF-based model
We see some minor improvements here, and we can look at a couple of examples to see what might be changing in the models’ predictions.
On this review, the TF-IDF model wrongly predicted negative, while the count model correctly predicted positive. Generally, TF-IDF seems to weigh the word “bad” as stronger evidence of a negative review, which is likely useful in most cases:
“Death Wish 3 is exactly what a bad movie should be. Terrible acting! Implausible scenerios! Ridiculous death scenes! Creepy, evil-for-no-reason villains!”
On this, TF-IDF correctly predicted negative, while the count model wrongly predicted positive. Again, an instance of words like “bad”, but in this case, it really was bad:
“I rented the DVD in a video store, as an alternative to reading the report. But it’s pretty much just more terror-tainment.While the film may present some info from the report in the drama, you’re taking the word of the producers – there’s no reference to the commission report anywhere in the film. Not one.The acting, all around, is pretty bad…”
These are cherry-picked examples, but they give some sense of the difference between the two representations.
Both of these methods treat each document as a “bag” of words. The counts are largely context-free (though TF-IDF does account for document and corpus characteristics). But we, as expert NLP systems, know that the meaning of words changes with context. A cute illustration of that idea:
The importance of context
“Well” can be how someone is feeling or a device for getting water, it all depends on context.
The third approach I want to introduce here is “word embeddings”, where a model is trained on a large, general-purpose corpus to create a word-level representation that incorporates information about the word’s context. The dimensions of these representations don’t have readily interpretable meaning, but taken together they provide useful general-purpose language information. One of the prime examples is below, whereby subtracting the representation of “man” from the representation for “king”, you get back (nearly) the representation for “queen.”
Algebra with word embeddings
This reflects what we’d call conceptual understanding, though this should be interpreted with caution.
So what happens if we use these word embeddings and create a document-level representation based on the count-weighted average? Can this improve our sentiment model?
Classification Report for word embedding-based model
The answer seems to be…no. But don’t give up! In my ODSC tutorial, we’ll continue with these and other approaches to build a system that can approach the state of the art using freely available, open-source tools! Join me in November!
About the author/ODSC West 2022 Speaker:
Benjamin Batorsky is a Senior Data Scientist at the Institute for Experiential AI. He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for the government, academia, and the private sector. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has spoken at data science conferences , taught courses in Data Science, and helped organize the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.