Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC, nor do they necessarily reflect the views of Edward’s...

Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC, nor do they necessarily reflect the views of Edward’s employer.

Part 1 Obtaining Transcripts for Campaign Trail Speeches

The political season is long and arduous.  As a former Ohioan I dreaded any election year because it is punctuated with endless negative and inflammatory campaign ads. Now that I live in the Democratic stronghold state of Massachusetts, I am relieved that my TV and radio are mostly free from such informationally barren commercials.  Instead I try to focus on the candidate topics and statements outside of commercials.  Besides, I always wonder who is swayed by these ads…but that is an analysis for another time!

As a data scientist with a passion for text mining, I am keenly interested in word choice. Over the course of the last year, Hillary Clinton and Donald Trump have spoken multiple times a day.  These speeches provide an ample corpus for a text mining.  Of course, I am not alone in my interest of candidate word choice.  The 24 hour news cycle spends considerable time on individual pithy candidate comments or out of context quotes.  For example, Hillary Clinton’s comment calling Trump supporters “deplorables” was covered incessantly for weeks and of course, Donald Trump has done some name calling too.  Surely name calling is not humanistic and probably not a good “look” for a candidate but the issues of today go beyond such nonsense.  

So I propose a quantitative and analytical approach based on multiple speeches to draw out the styles and topics of Hillary Clinton and Donald Trump speeches.  The analysis should lead to a balanced understanding of the candidates free from news anchor opinions.  This blog series is broken up into 4 parts to illustrate common text mining techniques applied to Trump and Clinton speeches.

The blog sequence covers:

  1. Obtaining Transcripts for Candidate Speeches
  2. Organizing Speeches & Initial Metrics
  3. Topic Modeling Visualizations
  4. Comparing Trump & Clinton

Whether you are a Democrat or a Republican I hope you enjoy the series and learn something along the way.

Finding Reliable Text

Surprisingly, it was difficult to get full transcripts of the stump speeches.  I suspect the average American relies on news articles with commentary, live feeds and social media.  Further, I didn’t want to rely on Liberal or Conservative websites or manual transcriptions that could be biased.  

So I settled on YouTube’s closed captioning data from actual Clinton and Trump speeches.   I have to assume Google’s transcribing software is not politically motivated so errors are unbiased.  After some developer sleuthing I found the caption data is in an XML file.

To gather Trump and Clinton speeches, I selected the Right Side Broadcasting YouTube Channel.  The channel uploads Trump rally speeches from around the country with consistent titles including states and dates.  Later I found another YouTube channel RBC Broadcasting that covers both candidate speeches.  The blog series uses speeches from both channels.    


The Right Side Broadcasting YouTube Channel page.

Follow these steps to get a single video’s caption.  Using Chrome, navigate to a video link such as https://www.youtube.com/watch?v=uXiJ8gudUwo.  Once there right click anywhere on the page and select “Inspect.”

This will open up the Chrome developer console alongside the video.  First click “Network” to change from the HTML information.  Next type “timed” into the filter box.  Lastly, click on the “cc” icon to enable closed captioning.


The steps to identify a speech’s closed caption information.

If the video offers closed captioning developer panel will display a file starting with “timedtext” as shown below.  Hover over the file name and right click to “open link in a new tab.” Once opened you can see the XML contains each word and second by second information. The URLs expire so be sure to parse them immediately.


Quick & Easy but Messy

Parsing this XML file can be done easily in R.   First load the xml2 package. The package provides common sense ways to extract information from extensible markup language (XML) files.


Next create a character string representing the URL of the speech captions.  In this example, change the text in between the quotes to your URL from the developer console.  This will construct the object url.

url<-'<Enter URL Address from developer console>' 

The next code snippet extracts only the text from the XML.  First, page is created by passing in the url to read_xml.  This instructs R to parse the XML document from the Internet.  Once retrieved, page is a list with head and body information.  This is passed to xml_text to extract only the speech words.  


speech <- xml_text(page)

The speech object now contains all caption words from the video.  However, the characters for a line break (/n) are scattered throughout the text.  When a computer parses text, the “/n” represents the End of Line (EOL) and denotes another line is starting.  Since this information was never said by the candidate it should be removed.  

Taking out the EOL markers is easy with gsub.  The “global substitution” function is first passed a string pattern to search. Adding the double slash before the “n” escapes the actual slash allowing gsub to correctly identify the line break.  The next parameter is the character replacement.  Here all line breaks are replaced by a single empty space between quotes.  Lastly, specify the speech object to search.  

speech <- gsub('\n', ' ', speech)  

Simple enough!  Once you have the URLs you only need 4 lines of code to extract a speech from YouTube.  If you followed along with the same video the last ten words of Donald Trump’s speech should read:

“…tomorrow   yeah yeah   yeah yeah yeah   yeah   America great again.”

It turns out the end of this video contains non Trump words.  Google’s transcription service incorrectly identifies “yeah” when the crowd is chanting “U.S.A.”  Additionally the broadcaster starts speaking at the end of the video.  If you are in a hurry, you can use gsub to remove the “yeah” or substitute it to “U.S.A.” Depending on time constraints, you can manually read and remove the point at which Trump starts talking too.      

More Effort but More Information

For a lot of videos the quick approach is acceptable.  However I decided to extract time information too.  To gather both words and start times add another library called rvest then create the same url and page objects.



url<-'<Enter URL Address from developer console>'


When you review the XML file in your browser you should see <p> tags similar to below.  

<p t="125000" d="6290" w="1">

<s ac="251">make</s>

<s t="119" ac="238">America</s>

<s t="569" ac="240">great</s>

<s t="1319" ac="251">again</s>


The code below extracts each <p> node including the text and t, d, and w values.  The xml_nodes function gathers all nodes based on an X-path query.  The code’s double slashes followed by “p” is x-path for any node labeled p.


The xml_attrs function returns XML node attributes such as the “t” value.  Using xml_attrs, the start.time object is a list containing all non-text XML information. The pluck function conveniently selects elements from within the list by name.  This lets you extract just the “t” values.  Pluck is nested in unlist function so the start time is a single vector of “t” values.



But there is a problem!  The start times are characters not numbers and they are really long.  So let’s modify the values to make more sense.    

The first thing to do is convert the character to numeric values.  Then divide each value by 1000.  Now “t” values are represented as whole seconds.


As before, you get the speech text with xml_text.  Then organize each word and the corresponding start time into a data frame.  



Additional cleanup is needed because crowd “USA” chants are captured as “yeah.”  Plus the line breaks still need to be removed.  The grep function searches for a pattern and returns its row position.  Here the drops object is created by searching for “yeah” OR “n.”  The pattern is separated by the pipe operator (|) which signifies “OR”.  So grep returns a row number if either pattern is found.  The row numbers are then subtracted from the original data frame using the minus sign.



Almost there!  Now search for “RSB” with grep.   The video producer “Right Side Broadcasting” mentions “RSB” so you should drop any rows after that because Trump is no longer speaking.  

rsb.comment<-grep("RSB", words.time$word)



This code gives you a reliable collection method with unbiased results when gathering Trump or Clinton speeches.  You can further automate the process using RSelenium but that may be against YouTube’s policies.  The following blog posts examine 20 speeches from both candidates but for now here is a quick taste of what’s to come.

In this speech, Trump’s positive and negative words are shown as a time series using the code below.


trump.speech<- ggplot(pol$all, aes(seq(1:length(polarity)),polarity))

trump.speech + stat_smooth()+ theme_gdocs()+ggtitle('Trump 9/15 NH')


©ODSC 2016

Edward Kwartler

Edward Kwartler

Working at Liberty Mutual I shape the organization's strategy and vision concerning next generation vehicles. This includes today's advanced vehicle ADAS features and the self driving cars of the (near) future. I get to work with exciting startups, MIT labs, government officials, automotive leaders and various data scientists to understand how Liberty can thrive in this rapidly changing environment. Plus I get to internally incubate ideas and foster an entrepreneurial ethos! Specialties: Data Science, Text Mining, IT service management, Process improvement and project management, business analytics