You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing
BlogNLP/Text AnalyticsNatural Language Processingposted by Julia Silge May 10, 2017 Julia Silge
It is a truth universally acknowledged that sentiment analysis is super fun, and Pride and Prejudice is probably my very favorite book in all of literature, so let’s do some Jane Austen natural language processing.
Project Gutenberg makes e-texts available for many, many books, including Pride and Prejudice which is available here. I am using the plain text UTF-8 file available at that link for this analysis. Let’s read the file and get it ready for analysis.
Munge the Data, But ELEGANTLY, As Would Befit Jane Austen
The plain text file has lines that are just over 70 characters long. We can read them in using the
readr library, which is super fast and easy to use. Let’s use the
n_max options to leave out the Project Gutenberg header and footer information and just get the actual text of the novel. Lines of 70 characters are not really a big enough chunk of text to be useful for my purposes here (that’s not even a tweet!) so let’s use
stringr to concatenate these lines in chunks of 10. That gives us sort of paragraph-sized chunks of text.
Maybe you don’t think for loops are elegant, actually, but I could not come up with a way to vectorize this.
Mr. Darcy Delivered His Sentiments in a Manner Little Suited to Recommend Them
To do the sentiment analysis, let’s use the NRC Word-Emotion Association Lexicon of Saif Mohammad and Peter Turney. You can read a bit more about the NRC sentiment dictionary and how it is used in one of my previous blog posts. It is implemented in R in the
I was not sure, when I stopped to think about it, exactly how appropriate this tool is for analyzing 200-year-old text. Language changes over time and from what I can tell, the NRC lexicon is designed and validated to measure the sentiment in contemporary English. It was created via crowdsourcing on Amazon’s Mechanical Turk. However, it doesn’t seem to do badly on Jane Austen’s prose; the sentiment results are about what one would expect compared to a human reading of the meaning. If anything, the text in Pride and Prejudice involves more dramatic vocabulary than a lot of contemporary English prose and it is easier for a tool like the NRC dictionary to pick up on the emotions involved.
Let’s look at some examples.
Oh, Mrs. Bennett…
So let’s start from a working hypothesis that the NRC lexicon can be applied to this novel and do the sentiment analysis for each chunk of text in our dataframe. At the same time, let’s make a
linenumber that counts up through the novel.
Dividing Up the Volumes
Pride and Prejudice contains 61 chapters divided into three volumes; Volume I is Chapters 1-23, Volume II is Chapters 24-42, and Volume III is Chapters 43-61. Let’s find where these breaks between volumes have ended up.
Let’s make a
volume factor for the dataframe and then restart the
linenumber count at the beginning of each volume.
Positive and Negative Sentiment
First let’s look at the overall postive vs. negative sentiment in the text of Pride and Prejudice before looking at more specific emotions.
Here, each chunk of text has a score for the positive sentiment and the negative sentiment; a given chunk of text could have high scores for both, low scores for both, or any combination thereof. I have made the sign of the negative sentiment negative for plotting purposes. Let’s make a dataframe of some important events in the novel to annotate the plots; I found the chapters for these events and matched them up to the correct volumes and line numbers.
Now let’s plot the positive and negative sentiment.
Narrative time runs along the x-axis. Volume II is the shortest of the three parts of the novel. We can see here that the positive sentiment scores are overall much higher than the negative sentiment, which makes sense for Jane Austen’s writing style. We can see some more strongly negative sentiment when Mr. Darcy proposes for the first time and when Lydia elopes. Let’s try visualizing these same data with a bar chart instead of points.