Intro to Language Processing with the NLTK Intro to Language Processing with the NLTK
Hidden information often lies deep within the boundaries of what we can perceive with our eyes and our ears. Some look... Intro to Language Processing with the NLTK

Hidden information often lies deep within the boundaries of what we can perceive with our eyes and our ears. Some look to data for that purpose, and most of the time, data can tell us more than we thought was imaginable. But sometimes data might not be clear cut enough to perform any sort of analytics. So what do you do when you’re at a standstill? If you have a large amount of text-rich data that would be impossible to read through, luckily, natural language processing can concentrate all that text into simple insights.

Language, tone, and sentence structure can explain a lot about how people are feeling, and can even be used to predict how people might feel about similar topics using a combination of the Natural Language Toolkit, a Python library used for analyzing text, and machine learning. For our purposes, we’ll work on a single body of text to clean and analyze key parts of past presidents’ inaugural speeches, which are included in NLTK’s corpus library. Once you have the basics, applying these techniques to a machine learning classification should be an easy task you can do with just about any text-rich data. Here’s how to get started.

As always, we start by installing and importing the proper packages for our project. Here’s the list of libraries I used in my notebook:

Language Processing with the NLTK
Next, we’ll download the inaugural speech data from NLTK’s corpus library. The speech I’ll be analyzing is Obama’s from 2009.

Language Processing with the NLTKWhen working with text files using NLTK, it’s essential to separate, or tokenize, each word in the document. Luckily, NLTK’s corpus library has built-in calls to tokenize files, so all we’ll need to do is specify the exact speech we want to explore.

Another important step is to remove stop words from the data. Stop words are what’s considered to be some of the more common English words like and, or, are, am, etc. These words aren’t that helpful in examining the language used in the speech, so it’s best to do away with them.

Language Processing with the NLTK
We can start looking at our data visually now with the help of the matplotlib library. If you’re unfamiliar with matplotlib, it’s a fairly simple tool that allows you to generate charts from raw data in python. Their website has several tutorials listed if you would like to toy around with data visualization.

Language Processing with the NLTK

That looks pretty good, but I think we can do a little bit more cleaning. We need to simplify our data even further so it can be learned easier if we end up applying machine learning algorithms to it. This process is called normalization, and it is important when working with even larger sets of data. For our purposes, we’ll just lemmatize the words in Obama’s speech, which will take words and reduce them to their base form.

And here’s what our outcome should look like:

Language Processing with the NLTK

Great! Now we have the cleanup tools necessary to work on data using the Natural Language Toolkit. We can use these packages to work on larger sets of data like a to perform sentiment analysis. These tricks can be helpful when looking into largely inconsistent data, like comments on a youtube thread, and can help us understand how people react to things on a large scale.

Editor’s note: Want to learn more about NLP in-person? Attend ODSC East 2020 in Boston this April 13-17 and learn from the experts directly!

Kailen Santos

I’m a freelance data journalist based in Boston, MA. Formally trained in both data science and journalism at Boston University, I aspire to make working with data easy and fun. If you work in a newsroom or if you’re just data-curious, I hope to help you explore data clearly. https://www.kailenjsantos.wordpress.com/