The American political season often conjures numerous references to the country’s origins from either side of the aisle. What better way to join in than by looking at the country’s birth using Data Science, the field that will dictate much of its future. I’ll do this by leveraging a subset of Natural Language Processing (NLP) known as Sentiment Analysis. The use cases are many and varied, especially for businesses seeking to gauge customers’ feelings on products and services. Here, though, it will be used on a much smaller scale with only one document: The Declaration of Independence.
The first step is to get some training data. I used this movie review dataset from Stanford University. There are 50,000 movie reviews split equally across training and testing sets and positive and negative instances. One problem is immediately evident. Even with combining the training and testing data sets it’s unlikely that the content will be close to that of the Declaration of Independence. This matter a lot for getting (seemingly) accurate results. Some more nuance in the labels would be nice to have as well. Still, let’s forge ahead.
In order to build a model the text data has to be preprocessed. This process can get complicated when true artists are at work. It will be much more straightforward here. Pronunciation marks will be removed, the phrases will be converted to lowercase, and lemmatization will be applied. In N.L.P lemmatization and stemming both refer to similar processes but the former is more robust.
Lemmatization reduces similar words to their base form to reduce redundancy. For example, ‘am’, ‘are’, and ‘is’ would all be converted to ‘be’. On the other hand, a stemmer would reduce the word ‘having’ to just ‘hav’.
The result is a matrix with over 1.6 million columns, much more than the number of data points. This will greatly increase the computational complexity of model building, so feature selection is advised.
The idea would be to try different values of the number of features to keep in feature selection, or keep all of those that are statistically significant. (SelectKBest in scikit-learn returns p-values for for the features.) However, for now I’m going to choose the best 100.
There is no worse result than seeing models beaten by random guessing. In this first pass I will proceed with the Random Forest Classifier to try some hyperparameter tuning.
Even a grid search produces poor results which are negligibly better than random guessing. Using the best estimator on all the training better adds an iota of predictive power. The confusion matrix shows that while the model does very well on negative training examples, it is surprisingly bad on positive ones.
There is a lot of work to be done here in terms of iterations to improve the model, but let’s move on for now. I want to compare these results to those provided by some APIs. Algorithmia’s sentiment analysis A.P.I. has a range of labels from 0 for very negative to 4 for very positive.
These look better, but there are some strange results. This is definitively not a negative statement.
And for the support of this declaration, with a firm reliance on the protection of Divine Providence, we mutually pledge to >each other our lives, our fortunes and our sacred honor.
Another API I want to try is MonkeyLearn. I discovered the platform after their post on the Twitter sentiment surrounding the Brexit vote.
MonkeyLearn only uses three labels – positive, neutral, and negative – but the results look much better.
In summary, the artisanal model predict every sentence in the Declaration of Independence as negative. Algorithmia is almost as cynical, but does has a few neutral and positive labels thrown in. Monkey Learn’s distribution of labels is much more evenly spread out.
It’s no surprise that the pre-built API’s work much better than the artisanal model which needs a lot more time in the oven. The importance of one’s training data cannot be understated. To built a model to work well across a number of domains, a large and varied corpus is mandatory. Another take-away is the importance of good feature engineering, especially in the face of computational limitations.
A future post will explore techniques to improve the artisanal model by overcoming the aforementioned limitations. For now you can compare the results of each model in this app.
©ODSC 2016, Feel free to share + backlink!