A research team hailing from Rensselaer Polytechnic Institute and Raytheon BBN Technologies used features from article content to create a predictive model for gauging a given community’s interest in news articles.
For the most part, past research in this realm has attempted to forecast general popularity in lieu of niche interest. Taking community preferences into account has largely been overlooked with the exception of a single content-based model that attained 77% accuracy after being trained on a Reddit dataset. Since people with similar interests tend to flock to the same communities, it is reasonable to assume that community-specific predictive power would be a valuable, effective tool. With this in mind, the researchers sought to improve upon the results of the content-based model.
To develop new machine learning models, the researchers aggregated data from various news communities across Reddit’s online platform, focusing on four particular subreddits characterized as “a general news community, a conspiracy news community, and two hyper-partisan news communities.” In total, over 60K articles were analyzed, and seven groups of features were distinguished, ranging from style to sentiment. Through linear-kernel support vector machines and random forest classifiers, the researchers identified the features that most heavily governed prediction. Classifier performance was then determined by the area under the receiver operating characteristic curve (AUC). The preferences of the mainstream community were best able to be separated from the others, achieving a near perfect AUC score of 1.0; the best scores for separating the conspiracy community from each hyper-partisan community were 0.81 AUC and 0.84 AUC respectively.
Looking forward, the researchers intend to employ hierarchical-binary models for community-interest prediction in any future work. Unlike standard multi-class models, hierarchical-binary models can help safeguard against underfitting data. In addition to this, further research may look more closely at performance loss over time to gain insight on how frequently machine learning models need to be retrained.
Find out more here.