Matt will be presenting more on Ulmfit at ODSC East 2019 this May! Check out his talk “State of the Art Text Classification with ULMFiT” there.
The rise of the internet has led to a faster flow of information, where news posted to a relatively obscure blog can be shared on social media and reach national publications within hours. The volume of information is such that humans alone cannot filter out noise, identify important new viewpoints, and determine how messaging trends are changing over time. At Novetta, we are constantly evaluating advances in deep learning to help our customers address these challenges.
Deep Learning for Text Classification
Recent advances in deep learning have significantly improved the performance for natural language processing (NLP) tasks such as text classification. One of the most promising advances is Universal Language Model Fine Tuning for Text Classification (ULMFiT), created by Jeremy Howard and Sebastian Ruder. In this paper, they demonstrated that applying transfer learning to NLP led to performance improvements of 18-24% on many standard text classification tasks.
Much like transfer learning for vision tasks, the power of ULMFiT comes from starting with a pre-trained model – in this case, a language model trained on wikitext-103. This pre-trained language model has learned to predict the next word in a sequence. Since language will be used differently in the target corpus, the pre-trained model is fine-tuned on the target corpus before the topic classifier is trained.
One my company’s products, Novetta Mission Analytics (NMA), is used to analyze trends in media over time. A core component of that analysis is the tagging of quotes from news articles by topic and sub-topic. This tagging is traditionally done by trained analysts, as the quality of the tags is of paramount importance to our customers. My machine learning team set out to evaluate how ULMFiT could be used to complement the analyst-based tagging process.
Though Howard and Ruder demonstrated the power of ULMFiT on a range of text datasets, we approached our experiments with some skepticism. We expected the task to be challenging because NMA topics are customer-specific, with as many as 150 sub-topics for a given customer. This is a more challenging task than that typically used in evaluating text classifiers.
From the perspective of a data scientist, NMA’s data is a gold mine – hundreds of thousands of hand-labeled quotes carefully collected over the last decade. In coordination with the NMA team, we selected training data and started to evaluate ULMFiT. We implemented ULMFiT using fastai, a deep learning library built on top of PyTorch. Using an example from the fastai repo on GitHub as our starting point, we set up a pipeline to fine-tune the language model on our quotes and then train a classifier. Our initial results were surprisingly good – 80-90% of the time the correct label appeared in the top 3 model predictions. We were somewhat surprised at how good these initial results were, so we took a deeper dive to see what could have artificially inflated the quality of our results, such as information leaking across our training and validation sets. After some additional data exploration, we were satisfied that our performance was indeed highly accurate – close, in fact, to that of our trained analysts.
ULMFiT in the Production
Since those initial experiments, we have started to evaluate ULMFiT-based models in our production NMA system to enhance the efficiency and quality of our tagging process.
We have also developed a custom pipeline through which we can modify the model for other datasets in as little as a day. This has enabled us to employ ULMFiT against a range of other use cases, such as classifying companies by industry type based solely on a few-sentence description of their activities.
We believe we have only scratched the surface of how automated text classification can help our customers make sense of large amounts of unstructured text. If you want to learn more about how we are deploying deep learning-based text classification at Novetta, come see my presentation at ODSC East 2019.