Word clouds are a useful visualization tool. They show the most frequent words in a text, where the relative size of the word correlates with frequency.
This is an example word cloud:
Word clouds are useful for at least two purposes:
- An initial exploration of text to discover which words are most numerous. While this can be achieved by printing a list of words in descending order by frequency, a word cloud will create an easier-to-read visual representation.
- In conjunction with a topic model, it creates visuals for each of the topics, where the most representative words for each topic are evident.
Let’s create a word cloud based on the book Sherlock Holmes by Arthur Conan Doyle. You can download the file with the text of the book on GitHub: https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook/blob/master/Chapter01/sherlock_holmes.txt.
First, import the necessary packages:
import os import nltk from os import path import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS from nltk.probability import FreqDist from PIL import Image import numpy as np
Now read in the text file and lowercase it:
text_file = "sherlock_holmes.txt" #Modify this path accordingly text = open(text_file, "r", encoding="utf-8").read() text = text.lower()
We need to remove the stopwords from the file, as otherwise the most prominent words will be words like I, he, the, etc. There are different ways of removing stopwords, including compiling a list, or removing the most frequent words. One way is to write a function that compiles a list of top 2% of words from a text, which we will later use as the stopwords list:
def compile_stopwords_list_frequency(text, freq_percentage=0.02): words = nltk.tokenize.word_tokenize(text) freq_dist = FreqDist(word.lower() for word in words) words_with_frequencies = [(word, freq_dist[word]) for word in freq_dist.keys()] sorted_words = sorted(words_with_frequencies, key=lambda tup: tup) length_cutoff = int(freq_percentage*len(sorted_words)) stopwords = [tuple for tuple in sorted_words[-length_cutoff:]] return stopwords
The function takes in two arguments: the text and the percentage that will be used for cutoff, which defaults to 2%. First, the function tokenizes the text into words and then it creates a FreqDist object. This object contains the word frequency information about the text. In the next line we get a list of tuples, where the first element is the word, and the second one is its frequency. We then sort the list by frequency, calculate the length cutoff using the percentage and get the stopwords list using this parameter.
Next, use this function to create the stopwords list:
stopwords = compile_stopwords_list_frequency(text) stopwords.remove("holmes") stopwords.remove("watson")
We remove the words holmes and watson from the list, as although they are frequent, they are not stopwords.
Now create the word cloud:
output_filename = "odsc_wordcloud.png" wordcloud = WordCloud(min_font_size=10, max_font_size=100, stopwords=stopwords, width=1000, height=1000, max_words=1000, background_color="white").generate(text) wordcloud.to_file(output_filename)
You can change the input parameters to the WordCloud object, to change the size of the picture, the minimum and maximum font sizes, and colors.
Use the following code to display the image while the program is running:
plt.figure() plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()
The resulting image will look something like this (it changes from run to run):
This word cloud still contains some stopwords (said, off, without), and you can experiment with modifying the stopwords list to get a cleaner result. In any case, you can see that the book talks about a woman, paper, police, business, money.
You can also see some phrases in the word cloud, such as Sherlock Holmes, said Holmes, of course, and others. Remove these by setting the collocations parameter to False when creating the WordCloud object:
wordcloud = WordCloud(min_font_size=10, max_font_size=100, stopwords=stopwords, width=1000, height=1000, max_words=1000, background_color="white", collocations=False).generate(text)
Finally, you can apply a shape to the cloud image. We will use the following shape:
Read in the image and generate the word cloud using it as a mask:
output_filename = "odsc_wordcloud_mask.png" sherlock_data = Image.open("sherlock.png") sherlock_mask = np.array(sherlock_data) wordcloud = WordCloud(background_color="white", max_words=2000, mask=sherlock_mask, stopwords=stopwords, min_font_size=10, max_font_size=100) wordcloud.generate(text) wordcloud.to_file(output_filename)
The result will look approximately like this:
While word clouds are useful for visualizing text data, topic models are a more formal tool to analyze topics in a text. I will discuss topic models at the tutorial Introduction to NLP and Topic Modeling at the ODSC West conference (https://odsc.com/speakers/introduction-to-nlp-and-topic-modeling/).
More code recipes can be found in my book, Python Natural Language Processing Cookbook: https://www.amazon.com/Python-Natural-Language-Processing-Cookbook/dp/1838987312/.
About the author/ODSC West 2021 speaker:
“Zhenya Antić is an NLP consultant and founder of Practical Linguistics Inc. Her projects include document summarization, information extraction, topic modeling and sentiment analysis of consumer reviews, and document similarity. Zhenya holds a PhD in Linguistics from the University of California Berkeley and a BS in Computer Science from the Massachusetts Institute of Technology.
Zhenya is the author of the Python Natural Language Processing Cookbook. Packt is giving out print books to 3 randomly chosen participants of the Introduction to NLP and Topic Modeling workshop on November 16.