Sentiment Analysis in R Made Simple Sentiment Analysis in R Made Simple
Sentiment analysis is located at the heart of natural language processing, text mining/analytics, and computational linguistics. It refers to any measurement... Sentiment Analysis in R Made Simple

Sentiment analysis is located at the heart of natural language processing, text mining/analytics, and computational linguistics. It refers to any measurement technique by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed sentiment in a range spanning from positive to negative.

The process of performing sentiment analysis involves converting the text into a machine-readable format. This is done using a number of preprocessing steps:  You must tokenize the text into single words, remove stop-words and punctuation, stem the text and convert it to lowercase. The R package we’ll use in this article performs these operations automatically.

Doing Sentiment Analysis in R

To demonstrate how sentiment analysis works, we’ll use the SentimentAnalysis package in R. This implementation utilizes various existing dictionaries, such as Harvard IV, QDAP, Loughran-McDonald, and DictionaryHE, which is a “dictionary with opinionated words from Henry’s Financial dictionary.” In addition, you can create customized dictionaries. In our example, we’ll use the acq data set from the tm package. This package holds 50 news articles from the Reuters-21578 data set. All documents belong to the topic of dealing with corporate acquisitions.

The R code below uses the analyzeSentiment() function to compute sentiment statistics for each article, then shows how many exhibit positive/negative sentiment. We’ll also show which article has the highest and lowest sentiment score. We’ll wrap up with a couple of data visualizations to help better understand the results of the analysis.


> # Sentiment analysis demo

< library(tm)
< library(SentimentAnalysis)

< # Simple example using a sentence. Note use of function 

> # convertToBinaryResponse() to convert a vector of 
> # continuous sentiment scores into a factor object.
> sentiment <- analyzeSentiment("My visit to Starbucks today was really lousy.")
> convertToBinaryResponse(sentiment)$SentimentQDAP
[1] negative
Levels: negative positive

> # More extensive example using the acq data sent from tm
> # package, a corpus of 50 Reuters news articles dealing 
> # with corporate acquisitions.
> data(acq)

> # Analyze sentiment, pass corpus
> # The names of the columns are: "WordCount", "SentimentGI",
> # "NegativityGI", "PositivityGI", "SentimentHE",
> # "NegativityHE", "PositivityHE", "SentimentLM",
> # "NegativityLM", "PositivityLM", "RatioUncertaintyLM",
> # "SentimentQDAP", "NegativityQDAP", "PositivityQDAP"
> # # Produces data frame 50x14

> sentiment <- analyzeSentiment(acq)  

> # Numeric vector containing sentiment statistics for each
> # article

> class(sentiment$NegativityLM)
[1] "numeric"

> # Count positive and negative categories for the 
> # 50 news releases. 
> table(convertToBinaryResponse(sentiment$SentimentLM))
negative positive 
      26   24 

> # News releases with highest and lowest sentiment
> # Show highest 
> acq[[which.max(sentiment$SentimentLM)]]$meta$heading

> # Show lowest 

> # View summary statistics of sentiment variable
> summary(sentiment$SentimentLM)
    Min.  1st Qu.   Median Mean  3rd Qu. Max. 
-0.13043 -0.03237 -0.01127 -0.01653  0.00000 0.04545 

> # Visualize density of standardized sentiment variable values
> hist(sentiment$SentimentLM, probability=TRUE,
     main="Histogram: Density of Distribution for Standardized Sentiment Variable")
> lines(density(sentiment$SentimentLM))

> # Calculate the cross-correlation 
> cor(sentiment[, c("SentimentLM", "SentimentHE", "SentimentQDAP")])
              SentimentLM SentimentHE SentimentQDAP
SentimentLM    1.00000000 -0.01850194    0.37339476
SentimentHE   -0.01850194 1.00000000   -0.07240745
SentimentQDAP  0.37339476 -0.07240745    1.00000000

> # Draw a simple line plot to visualize the evolvement of 
> # sentiment scores. Helpful when studying a time series 
> # of sentiment scores.
> plotSentiment(sentiment$SentimentLM, xlab="Reuters News Articles")


The area of sentiment analysis has received a lot of traction in the past few years. The amount of unstructured text data is increasing at a fast clip. Text analytics — and sentiment analysis in particular — are making a difference to hasten the path to insights.

In R, there are many good packages to use. We used SentimentAnalysis, but also could have used tidytext, another general text mining toolbox with sentiment analysis functionality.

You may also wish to check out a new sentiment analysis research paper published by SentimentAnalysis package author Stefan Feuerriegel: “Sentiment analysis based on rhetorical structure theory: Learning deep neural networks from discourse trees.”

Editor’s note: Want to learn more about NLP in-person? Attend ODSC East 2020 in Boston this April 13-17 and learn from the experts directly!

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.