Sentiment analysis is located at the heart of natural language processing, text mining/analytics, and computational linguistics. It refers to any measurement technique by which subjective information is extracted from textual documents. In other words, it extracts the polarity of the expressed sentiment in a range spanning from positive to negative.
The process of performing sentiment analysis involves converting the text into a machine-readable format. This is done using a number of preprocessing steps: You must tokenize the text into single words, remove stop-words and punctuation, stem the text and convert it to lowercase. The R package we’ll use in this article performs these operations automatically.
Doing Sentiment Analysis in R
To demonstrate how sentiment analysis works, we’ll use the
SentimentAnalysis package in R. This implementation utilizes various existing dictionaries, such as Harvard IV, QDAP, Loughran-McDonald, and DictionaryHE, which is a “dictionary with opinionated words from Henry’s Financial dictionary.” In addition, you can create customized dictionaries. In our example, we’ll use the
acq data set from the
tm package. This package holds 50 news articles from the Reuters-21578 data set. All documents belong to the topic of dealing with corporate acquisitions.
The R code below uses the
analyzeSentiment() function to compute sentiment statistics for each article, then shows how many exhibit positive/negative sentiment. We’ll also show which article has the highest and lowest sentiment score. We’ll wrap up with a couple of data visualizations to help better understand the results of the analysis.
> # Sentiment analysis demo < library(tm) < library(SentimentAnalysis) < # Simple example using a sentence. Note use of function > # convertToBinaryResponse() to convert a vector of > # continuous sentiment scores into a factor object. > sentiment <- analyzeSentiment("My visit to Starbucks today was really lousy.") > convertToBinaryResponse(sentiment)$SentimentQDAP  negative Levels: negative positive > # More extensive example using the acq data sent from tm > # package, a corpus of 50 Reuters news articles dealing > # with corporate acquisitions. > data(acq) > # Analyze sentiment, pass corpus > # The names of the columns are: "WordCount", "SentimentGI", > # "NegativityGI", "PositivityGI", "SentimentHE", > # "NegativityHE", "PositivityHE", "SentimentLM", > # "NegativityLM", "PositivityLM", "RatioUncertaintyLM", > # "SentimentQDAP", "NegativityQDAP", "PositivityQDAP" > # # Produces data frame 50x14 > sentiment <- analyzeSentiment(acq) > # Numeric vector containing sentiment statistics for each > # article > class(sentiment$NegativityLM)  "numeric" > # Count positive and negative categories for the > # 50 news releases. > table(convertToBinaryResponse(sentiment$SentimentLM)) negative positive 26 24 > # News releases with highest and lowest sentiment > # Show highest > acq[[which.max(sentiment$SentimentLM)]]$meta$heading  "VERSATILE TO SELL UNIT TO VICON" > # Show lowest acq[[which.min(sentiment$SentimentLM)]]$meta$heading  "GULF APPLIED TECHNOLOGIES <GATS> SELLS UNITS" > # View summary statistics of sentiment variable > summary(sentiment$SentimentLM) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.13043 -0.03237 -0.01127 -0.01653 0.00000 0.04545 > # Visualize density of standardized sentiment variable values > hist(sentiment$SentimentLM, probability=TRUE, main="Histogram: Density of Distribution for Standardized Sentiment Variable") > lines(density(sentiment$SentimentLM)) > # Calculate the cross-correlation > cor(sentiment[, c("SentimentLM", "SentimentHE", "SentimentQDAP")]) SentimentLM SentimentHE SentimentQDAP SentimentLM 1.00000000 -0.01850194 0.37339476 SentimentHE -0.01850194 1.00000000 -0.07240745 SentimentQDAP 0.37339476 -0.07240745 1.00000000 > # Draw a simple line plot to visualize the evolvement of > # sentiment scores. Helpful when studying a time series > # of sentiment scores. > plotSentiment(sentiment$SentimentLM, xlab="Reuters News Articles")
The area of sentiment analysis has received a lot of traction in the past few years. The amount of unstructured text data is increasing at a fast clip. Text analytics — and sentiment analysis in particular — are making a difference to hasten the path to insights.
In R, there are many good packages to use. We used
SentimentAnalysis, but also could have used
tidytext, another general text mining toolbox with sentiment analysis functionality.
You may also wish to check out a new sentiment analysis research paper published by
SentimentAnalysis package author Stefan Feuerriegel: “Sentiment analysis based on rhetorical structure theory: Learning deep neural networks from discourse trees.”
Editor’s note: Want to learn more about NLP in-person? Attend ODSC East 2020 in Boston this April 13-17 and learn from the experts directly!