fbpx
Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC, nor do they necessarily reflect the views of Edward’s employer. Looks like...

Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC, nor do they necessarily reflect the views of Edward’s employer.


Looks like the presidential race is heating up!  Many polls have the candidates in a tight race.  Just after the debate both Clinton and Trump proclaimed “victory.” Instead of sound bites in this post we will calculate two overall text statistics.  I will demonstrate a way to calculate the grade level of the Clinton and Trump campaign speeches.  Next, we will explore the positive and negative language in these speeches to calculate a metric called polarity.  Along the way I will show you how to make some quick visuals.

The text in this post was gathered from YouTube closed caption files.  If you missed how to collect this data check out the first post in this series here.  If you want to follow along with this code get the 20 Clinton and Trump speeches.

No matter your party affiliation I hope you find the text mining approaches in the series to be informative.  I wouldn’t draw too many conclusions from this small sample set…I mean to only educate on some cool uses of R!

Organizing Multiple Speeches

Often when I am doing a text mining project I have a lot of individual files.  Dealing with large separate files requires knowing how to import them efficiently.  Additionally I hate rewriting code and so I make use of custom functions applied to lists. Using functions helps to standardize manipulations and reduces mistakes.

First load your libraries.  Quanteda is used for quantitative analysis on text.  The package has a good wrapper for calculating readability measures including reading grade level.  The qdap package is another quantitative discourse library.  The package has an easy polarity scoring function.  Next, data.table package efficiently loads and manipulates data.  Using data.table during a text mining project with multiples files makes the analysis faster and easier.  The rvest packages is usually used for easy web scraping.  However when I work with nested lists I like the pluck function to easily extract list elements. The pbapply is one of my favorite packages, but it is completely optional.  It should be way more popular!  The “pb” stands for progress bar and the package prints a progress bar when you use apply, lapply and sapply. The last two libraries, ggplot2, and ggthemes are used for constructing visuals.

library(quanteda)

library(qdap)

library(data.table)

library(rvest)

library(pbapply)

library(ggplot2)

library(ggthemes)

After loading libraries, let’s read and organize each candidates speeches using a custom function.  Within the function, list.files searches the working directory and returns a character vector of file names meeting the search pattern which is defined as the candidate input.  The candidate.files vector is passed to fread which is a fast file reading function.  At this point the candidate.speeches is a list.  The elements of the list are given names based on the original candidate.files vector.  This will help you identify things later.

speech.read<-function(candidate){

  candidate.files<-list.files(pattern=candidate)

  candidate.speeches<-pblapply(candidate.files,fread)

  names(candidate.speeches)<-candidate.files

  return(candidate.speeches)

}

Armed with this function you can quickly load each candidates’ speeches.  This type of function keeps you from writing separate write.csv function calls.  For each candidate pass in the search term…just be sure the candidate name is somewhere in the file name!

trump<-speech.read('Trump')

clinton<-speech.read('Clinton')

The trump and clinton lists each contain 10 data frames with two columns.  The “start” column contains the seconds a statement was made and the second column contains the words.  For simplicity I use pluck to extract the column called “word.”  The trump.words and clinton.words are now just lists of individual character vectors.  Throughout this post we will use the trump, clintontrump.words and clinton.words objects for our analysis.

trump.words<-pluck(trump,'word')

clinton.words<-pluck(clinton,'word')

Readability- What grade level was that speech?

A text’s readability score measures a text’s syntax complexity, vocabulary and other factors.  There are many readability measures such as Flesch-Kincaid or Spache, but this analysis uses “Forcast.RGL.”  Forcast.RGL is easy to understand and is good for learning.  Forcast was originally used for military manuals after research of Vietnam draftees.  The “RGL” stands for reading grade level.  I should note that Forcast is a readability measure although the texts are from spoken word so the measure could be biased.

To calculate Forcast readability grade level:

  1. Sample 150 words from the text
  2. Count the number of single syllable words in the sample.  This is “N”
  3. Divide “N” by 10
  4. Subtract #3’s output from 20.  

Again a custom function will make the code more efficient.  The speech.readability function accepts a list of words in a vector.  First it collapses the words into a single object.  Next the quanteda corpus function changes the text’s class.  Lastly, the readability function is applied along with the specific measure, FORCAST.RGL.

speech.readability<-function(speech){

  speech<-paste(speech,collapse='')

  speech<-corpus(speech)

  grade<-readability(speech,"FORCAST.RGL")

}

Now I use pblapply along with the candidate words and the speech.readability function.  The result is nested in unlist so each speech has a single grade level.

trump.grades<-unlist(pblapply(trump.words,speech.readability))

clinton.grades<-unlist(pblapply(clinton.words,speech.readability))

The results are easy to examine using base functions like min, max, or range.  With this small sample set Trump’s speeches minimum score 10.17 while Clinton scores 10.23.

min(trump.grades)

min(clinton.grades)

I am a visual learner so I like to create simple plots to understand my data.  This code creates abarplot of Trump grades with speech names and using “Republican red.”  After creating the bar plot labels are added to the bars using text.  The labels are shortened by 11 characters using substr to make the bars less cluttered.

grading<-barplot(trump.grades, col='#E91D0E', xaxt="n",main='Trump Speech Reading Grade Level')

text(grading, 5.5, labels= substr(names(trump.grades),1,nchar(names(trump.grades))-11), srt=90, col='white')

image00

Trump’s Forecast Reading Level is pretty consistent.

Polarity – Grading positive and negative tone in speeches.

It turns out both candidates are fairly similar in grade level when talking to supporters.  Now let’s turn to sentiment analysis and more specifically polarity.  Polarity scores text based on identifying positive and negative words in a “subjectivity lexicon.”  Here qdap supplies a pre-constructed list of positive and negative words from the University of Illinois-Chicago.  The polarity function tags words in the subjectivity list and applies the following steps.

  1. Tag positive and negative words from the key.pol subjectivity lexicon.  In more robust analyses you should customize these words.
  2. Once a tagged word is identified the four preceding and two following terms are “clustered.”
  3. Within the cluster, positive words are valued at 1 and negative count as -1.  Neutral words have a value of zero.  The remaining words are counted as “valence shifters.” An example valence shifter is “very” as in “very good” which amplifies the positive intent.  Positive valence shifting words and negative add or subtract 0.8 to the polarity score.
  4. The sum of positive and negative words along with positive and negative valence shifters is saved for each cluster.
  5. The sum is divided by the square root of all words in a passage.  This helps measure the density of the keywords.

The custom function, speech.polarity, will take some time to compute.  Tagging words from among thousands in a subjectivity lexicon is computationally intensive.  Thus, the pblapply function is helpful.  In this function the speech words are passed to qdap’s polarity.  Then pluck is applied to the result so only speech level polarity measures are captured instead of individual word statistics.  The list of data frames in then organized into a single data frame using do.call and rbind.  The do.call and rbind combination are very helpful when dealing with lists.

speech.polarity<-function(speech){

  speech.pol<-pblapply(speech,polarity)

  speech.pol<-pluck(speech.pol,'group')

  speech.pol<-do.call(rbind,speech.pol)

}

It turns out one of the positive words in the basic subjectivity lexicon is “trump.”  Obviously Clinton uses Trump’s name a lot so it must be removed prior to scoring.  I decided to remove “trump” in the speeches rather than adjust the subjectivity lexicon.   Rapply recursively applies a function to list elements.  List elements, “x,” are passed to the gsub function which is then applied to each element.  The gsub function is a “global substitution” that replaces “trump” blank character.

clinton.wo.tr<- rapply(clinton.words, function(x) gsub("trump", ", x), how = "replace")

trump.wo.tr<- rapply(trump.words, function(x) gsub("trump", ", x), how = "replace")

Now you can apply the speech.polarity function to each candidate’s speech words.  The result is a simple data frame with 10 rows corresponding to specific speeches.  Again base functions like range can help you compare results.  The 10 Trump speeches have a wider range compared to Clinton, -0.024 to 0.099.  Clinton’s range was 0.005, to 0.086.

clinton.pol<-speech.polarity(clinton.wo.tr)

trump.pol<-speech.polarity(trump.wo.tr)

range(trump.pol$ave.polarity)

range(clinton.pol$ave.polarity)

A simple barplot can be constructed to learn about the data.  This code plots Clinton’s average speech polarity in “Democrat blue.”  The text function adds the rounded polarity values in white at the top.  Looking at this barplot Clinton is very positive except at her Reno speech!

clinton.bars<-barplot(clinton.pol$ave.polarity, col='#232066', main='Clinton Speech Polarity')

text(x = clinton.bars, y = clinton.pol$ave.polarity, label = round(clinton.pol$ave.polarity,3), pos = 1, cex = 0.8, col = "white")

image02

During Clinton’s August 25th speech her tone was more negative than usual.

Next I wanted to look at the speeches as a timeline of positive and negative words.  As a time series I hope to understand candidate styles.  I decided to create another function applied to each speech with a title to quickly make 20 plots.  Instead of using the polarity group element now the all object is selected.  Then ggplot is used to create a time series.  I use the geom_smooth line to make the graph more appealing along with theme_gdocs.  I added a horizontal red line at 0 to highlight when a candidate’s language becomes negative.

ind.speech.pol <-function(speech, speech.title){

  speech.pol <-polarity(speech$word)$all

  plot.speech <- ggplot(speech.pol, aes(seq(1:nrow(speech.pol)) , polarity))

  plot.speech <- plot.speech+ theme_gdocs()+ stat_smooth()+ ggtitle(speech.title)

  plot.speech <- plot.speech+geom_hline(aes(yintercept=0), col='darkred', size=2)

  return(plot.speech)

}

This function needs to be applied to each candidate’s speech corpus.  Another of my favorite functions is Map.  The Map function applies a function to an object along a vector.  In this example theind.speech.pol function is applied to trump.  This is done in the order of the speech names which become the plot titles.

Map(ind.speech.pol,trump,names(trump))

Map(ind.speech.pol,clinton,names(clinton))

 

With just two lines of code, you get 20 plots!  Below is Clinton’s Reno speech.

image01

Clinton’s most negative speech had a portion of negative language but ended on a very positive tone.  

Lastly I am interested in specific terms used by candidates.  The custom search.plot function makes it easy to quickly generate a visual identifying word usage within a speech timeline.  The function accepts the speech list, the term I want to find and then a color for the dots.  Using mapply the speeches are organized into a single data frame with a new column. The new column, speech, is created because R recycles the name of the list element for all rows.  In this way you can organize all the text but retain the original source as a categorical variable.  I also use substr to shorten the speech factors making the plot less cluttered.  Next the temp.df data frame is changed from list of data frames to a single object.  The p object represents the plot output from the function.  In itggplot references temp.df.

search.plot<-function(speech.list,search.pattern, dot.col){

  temp.df <- mapply(cbind, speech.list,

                     "speech"=as.factor(

                       substr(names (speech.list) ,1, nchar(names(speech.list))-11)),SIMPLIFY=F)

 

  temp.df <-do.call(rbind,temp.df)

  temp.df <-temp.df[grep(search.pattern,temp.df$word,ignore.case = T),]

  p <-ggplot(temp.df, aes(x=start, y=speech))

  p <-p+geom_point(stat="identity", col=dot.col)+theme_gdocs()

  p <-p+ggtitle(paste('Mentions of',search.pattern))

  return(p)

}

I searched for Clinton’s mention of Trump but you can use your own terms or switch candidates.  Not surprisingly Clinton’s most negative speech has the most mentions of Trump!

search.plot(clinton,'trump','#232066')

image03

 

Clinton’s most negative speech mentions Trump a lot!

 

Conclusion

I hope you enjoyed this second post about candidate speeches.  I think readability and polarity are interesting text KPI that you should explore when doing your own text mining project.  Plus the post shows how creating custom functions helps you make more concise code.

Be on the lookout for my next post where we tackle Topic Modeling to learn more about Clinton and Trump.


©ODSC 2016

Edward Kwartler

Edward Kwartler

Working at Liberty Mutual I shape the organization's strategy and vision concerning next generation vehicles. This includes today's advanced vehicle ADAS features and the self driving cars of the (near) future. I get to work with exciting startups, MIT labs, government officials, automotive leaders and various data scientists to understand how Liberty can thrive in this rapidly changing environment. Plus I get to internally incubate ideas and foster an entrepreneurial ethos! Specialties: Data Science, Text Mining, IT service management, Process improvement and project management, business analytics

1