You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing
It is a truth universally acknowledged that sentiment analysis is super fun, and Pride and Prejudice is probably my very favorite book in all... You Must Allow Me To Tell You How Ardently I Admire and Love Natural Language Processing

It is a truth universally acknowledged that sentiment analysis is super fun, and Pride and Prejudice is probably my very favorite book in all of literature, so let’s do some Jane Austen natural language processing.

Project Gutenberg makes e-texts available for many, many books, including Pride and Prejudice which is available here. I am using the plain text UTF-8 file available at that link for this analysis. Let’s read the file and get it ready for analysis.

Munge the Data, But ELEGANTLY, As Would Befit Jane Austen

The plain text file has lines that are just over 70 characters long. We can read them in using the readr library, which is super fast and easy to use. Let’s use the skip and n_max options to leave out the Project Gutenberg header and footer information and just get the actual text of the novel. Lines of 70 characters are not really a big enough chunk of text to be useful for my purposes here (that’s not even a tweet!) so let’s use stringr to concatenate these lines in chunks of 10. That gives us sort of paragraph-sized chunks of text.

library(readr)
library(stringr)
rawPandP <- read_lines("./pg1342.txt", skip = 30, n_max = 13032)
PandP <- character()
for (i in seq_along(rawPandP)) {
        if (i%%10 == 1) PandP[ceiling(i/10)] <- str_c(rawPandP[i], 
                                                     rawPandP[i+1],
                                                     rawPandP[i+2],
                                                     rawPandP[i+3],
                                                     rawPandP[i+4],
                                                     rawPandP[i+5],
                                                     rawPandP[i+6],
                                                     rawPandP[i+7],
                                                     rawPandP[i+8],
                                                     rawPandP[i+9], sep = " ")
}

Maybe you don’t think for loops are elegant, actually, but I could not come up with a way to vectorize this.

Mr. Darcy Delivered His Sentiments in a Manner Little Suited to Recommend Them

To do the sentiment analysis, let’s use the NRC Word-Emotion Association Lexicon of Saif Mohammad and Peter Turney. You can read a bit more about the NRC sentiment dictionary and how it is used in one of my previous blog posts. It is implemented in R in the syuzhet package.

I was not sure, when I stopped to think about it, exactly how appropriate this tool is for analyzing 200-year-old text. Language changes over time and from what I can tell, the NRC lexicon is designed and validated to measure the sentiment in contemporary English. It was created via crowdsourcing on Amazon’s Mechanical Turk. However, it doesn’t seem to do badly on Jane Austen’s prose; the sentiment results are about what one would expect compared to a human reading of the meaning. If anything, the text in Pride and Prejudice involves more dramatic vocabulary than a lot of contemporary English prose and it is easier for a tool like the NRC dictionary to pick up on the emotions involved.

Let’s look at some examples.

library(syuzhet)
get_nrc_sentiment("Nobody can tell what I suffer! But it is always so. Those who do not complain are never pitied.")
##   anger anticipation disgust fear joy sadness surprise trust negative positive
## 1     1            0       0    0   0       1        0     0        2        0

Oh, Mrs. Bennett…

Pride And Prejudice GIF - Find & Share on GIPHY

 

via GIPHY

get_nrc_sentiment("And your defect is to hate everybody.")
##   anger anticipation disgust fear joy sadness surprise trust negative positive
## 1     2            0       1    1   0       1        0     0        2        0

 
Colin Firth Lizzy Your Sass Is Showing GIF - Find & Share on GIPHY

 

via GIPHY

get_nrc_sentiment("You must allow me to tell you how ardently I admire and love you.")
##   anger anticipation disgust fear joy sadness surprise trust negative positive
## 1     0            0       0    0   1       0        0     1        0        2

 

 

via GIPHY

So let’s start from a working hypothesis that the NRC lexicon can be applied to this novel and do the sentiment analysis for each chunk of text in our dataframe. At the same time, let’s make a linenumber that counts up through the novel.

PandPnrc <- cbind(linenumber = seq_along(PandP), get_nrc_sentiment(PandP))

Dividing Up the Volumes

Pride and Prejudice contains 61 chapters divided into three volumes; Volume I is Chapters 1-23, Volume II is Chapters 24-42, and Volume III is Chapters 43-61. Let’s find where these breaks between volumes have ended up.

grep("Chapter 1 ", PandP)
## [1] 1
grep("Chapter 24", PandP)
## [1] 451
grep("Chapter 43", PandP)
## [1] 805

Let’s make a volume factor for the dataframe and then restart the linenumber count at the beginning of each volume.

PandPnrc$volume <- "Volume I"
PandPnrc[grep("Chapter 24", PandP):length(PandP),'volume'] <- "Volume II"
PandPnrc[grep("Chapter 43", PandP):length(PandP),'volume'] <- "Volume III"
PandPnrc$volume <- as.factor(PandPnrc$volume)
PandPnrc$linenumber[PandPnrc$volume == "Volume II"] <- seq_along(PandP)
PandPnrc$linenumber[PandPnrc$volume == "Volume III"] <- seq_along(PandP)

Positive and Negative Sentiment

First let’s look at the overall postive vs. negative sentiment in the text of Pride and Prejudice before looking at more specific emotions.

library(dplyr)
library(reshape2)
PandPnrc$negative <- -PandPnrc$negative
posneg <- PandPnrc %>% select(linenumber, volume, positive, negative) %>% 
        melt(id = c("linenumber", "volume"))
names(posneg) <- c("linenumber", "volume", "sentiment", "value")

Here, each chunk of text has a score for the positive sentiment and the negative sentiment; a given chunk of text could have high scores for both, low scores for both, or any combination thereof. I have made the sign of the negative sentiment negative for plotting purposes. Let’s make a dataframe of some important events in the novel to annotate the plots; I found the chapters for these events and matched them up to the correct volumes and line numbers.

annotatetext <- data.frame(x = c(114, 211, 307, 183, 91, 415), y = rep(18.3, 6), 
                           label = c("Jane's illness", "Mr. Collins arrives", 
                                     "Ball at Netherfield", "Mr. Darcy proposes", 
                                     "Lydia elopes", "Mr. Darcy proposes again"),
                           volume = factor(c("Volume I", "Volume I", 
                                             "Volume I", "Volume II", 
                                             "Volume III", "Volume III"),
                                           levels = c("Volume I", "Volume II", "Volume III")))
annotatearrow <- data.frame(x = c(114, 211, 307, 183, 91, 415), 
                            y1 = rep(17, 6), y2 = c(11.2, 10.7, 11.4, 13.5, 10.5, 11.5), 
                            volume = factor(c("Volume I", "Volume I", 
                                              "Volume I", "Volume II", 
                                              "Volume III", "Volume III"),
                                            levels = c("Volume I", "Volume II", "Volume III")))

Now let’s plot the positive and negative sentiment.

library(ggplot2)
library(ggthemes)
ggplot(data = posneg, aes(x = linenumber, y = value, color = sentiment)) +
        facet_wrap(~volume, nrow = 3) +
        geom_point(size = 4, alpha = 0.5) + theme_minimal() +
        ylab("Sentiment") + 
        ggtitle(expression(paste("Positive and Negative Sentiment in ", 
                                 italic("Pride and Prejudice")))) +
        theme(legend.title=element_blank()) + 
        theme(axis.title.x=element_blank()) +
        theme(axis.ticks.x=element_blank()) +
        theme(axis.text.x=element_blank()) +
        geom_text(data = annotatetext, aes(x,y,label=label), hjust = 0.5, 
                  size = 3, inherit.aes = FALSE) +
        geom_segment(data = annotatearrow, aes(x = x, y = y1, xend = x, yend = y2),
                     arrow = arrow(length = unit(0.05, "npc")), inherit.aes = FALSE) +
        theme(legend.justification=c(1,1), legend.position=c(1, 0.71)) +
        scale_color_manual(values = c("aquamarine3", "midnightblue"))

center

Narrative time runs along the x-axis. Volume II is the shortest of the three parts of the novel. We can see here that the positive sentiment scores are overall much higher than the negative sentiment, which makes sense for Jane Austen’s writing style. We can see some more strongly negative sentiment when Mr. Darcy proposes for the first time and when Lydia elopes. Let’s try visualizing these same data with a bar chart instead of points.

ggplot(data = posneg, aes(x = linenumber, y = value, color = sentiment, fill = sentiment)) +
facet_wrap(~volume, nrow = 3) +
geom_bar(stat = "identity", position = "dodge") + theme_minimal() +
ylab("Sentiment") +
ggtitle(expression(paste("Positive and Negative Sentiment in ",
italic("Pride and Prejudice")))) +
theme(legend.title=element_blank()) +
theme(axis.title.x=element_blank()) +
theme(axis.ticks.x=element_blank()) +
theme(axis.text.x=element_blank()) +
theme(legend.justification=c(1,1), legend.position=c(1, 0.71)) +
geom_text(data = annotatetext, aes(x,y,label=label), hjust = 0.5,
size = 3, inherit.aes = FALSE) +
geom_segment(data = annotatearrow, aes(x = x, y = y1, xend = x, yend = y2),
arrow = arrow(length = unit(0.05, "npc")), inherit.aes = FALSE) +
scale_fill_manual(values = c("aquamari

Julia Silge

Julia Silge

My background in the physical sciences and programming has given me the tools to apply sophisticated analytical techniques to complicated problems. I am a data scientist and analyst with an understanding of mathematics and statistical models. Analyzing, understanding, and communicating about data makes me happy and I am passionate about finding insights in data and building data products to meet the needs of an organization. I come from a background in physics and astronomy and have worked in academia and ed tech before moving into data science. My experience in the physical sciences and education has given me a solid foundation for using data to answer interesting questions, and then communicating those findings to decision makers. I work effectively in both independent and collaborative environments, I learn new skills and subjects quickly, and I have proven writing and speaking abilities.