At ODSC London 2018, Dr. Michael Swarbrick Jones, Ph.D. gave a technical lecture on the latest in large-scale multilabel classification and how practitioners should be managing their hyperparameters and data to get the most out of their models, focusing specifically on problems where there are thousands or tens of thousands of labels.
Dr. Jones started by introducing a dataset he had manually created using the popular link-sharing site Reddit. Jones used all subreddits – topical threads on the site – with more than a thousand posts and manually classified them using more than a thousand different labels. This dataset is now available on Kaggle under the Creative Commons Public Domain license and can be used free of charge for future research.
Dr. Jones then began to discuss different approaches to text classification problems. Here are some of the top tips that came out of his talk:
Keeping It Simple
There’s no shame in going back to basics. Referencing Google’s research on text classification tasks, Jones recommended a bag of words approach when the ratio of documents to words per document is less than 1,500 – so if you’re trying your hand at text classification and have sparse text, give Naive Bayes a shot before moving on to more sophisticated methods like neural networks.
How Big is a Gram?
Along with unigrams (one-word grams in the input vector), Jones suggested trying bigrams (two-word grams) when deciding on the input vector. For example, in the phrase “man bites dog,” the unigrams are ‘man’, ‘bites’ and ‘dog’, whereas the bigrams are ‘man bites’ and ‘bites dog’. The input vector is simply a count of how many times each gram it represents has appeared in a given document. In the bigram case, both the unigrams and bigrams are counted in the input vector.
In order to prevent data leakage, order your data by time before splitting. This is because using a post from the future to predict on posts from the past breaks assumptions about how information is siloed during training. For a deeper lecture on data leaks and how they can damage a model, check out Yuriy Guts’ talk from London 2018 as well!
Smaller Taxonomies are Better
If you have the opportunity to build your own taxonomy scheme for classification, Jones recommends building it in a ‘breadth-first’ fashion. This is because intuition tells us that it can be very difficult for a classification engine to predict deep levels of subclasses, while breadth allows it to parse out a wider variety of categories.
If you choose to build your own taxonomy, carefully consider what level of granularity you really need in your classification scheme. Do you really need all 10,000 classes, or will 1,000 fulfill your use case?
Accuracy Isn’t Everything
Besides accuracy, ‘macro’ metrics are also useful for classification tasks at this level. The quality of a model should be judged by how well it performs on rare labels as well as common ones. Since a model can boost its accuracy by only guessing common labels, accuracy should not be taken at face value when evaluating.
Feature selection is incredibly important for building a model that is capable of deriving useful insights from the data, especially when dealing with complex models such as neural networks. This means considering rare labels as well as common ones, but it also entails limiting the number of features being used where appropriate.
Just as important as creating a model that works in the short term, you must consider whether your model will continue to be performant in the future. This is because text corpuses are liable to change over time, altering the word distribution and the context in which certain useful words appear. To an extent, this problem can be mitigated by ordering the data chronologically when splitting, though there is no guarantee for how long this will enable your model to predict accurately. All you can do is update the data and retrain.
Don’t Do It!
Above all, Dr. Jones recommended avoiding this problem whenever possible. While there have been significant gains in recent years, the best models are still far from perfect. Additionally, the models tend to be very expensive to train and require massive datasets to earn middling performance.
These are just the highlights of Dr. Jones’ talk. You can check out the full lecture on YouTube, which will include more helpful nuggets for practitioners and researchers attempting to tackle large-scale multilabel classification.