The growth of Natural Language Processing (NLP) models in practical use in applications has grown rapidly in recent years. But with this growth still hasn’t solved the issues of errors related to both data and output. With such a barrier, what can teams do to combat this issue, and ensure that their NLP models are operating correctly?
Graham Neubig, PhD, professor at the Language Technologies Institute of Carnegie Mellon University, delivered an insightful keynote at ODSC West 2022. In the opening of the Keynote, he began with the latest advancements in NLP technology, from achieving human parity in automatic Chinese-to-English news translation to the use of GPT-3 to compete with Google’s own search engine.
This leads to a new paradigm in NLP – solving NLP tasks through text generation using language models and prompting. Though often these models can confidently answer a question, is the answer correct? In one example, Graham asks a model a simple question: “What are the largest states in the US by population and surface area?” The model replied with “Alaska and Texas” which is wrong since California is the largest state by population.
In another question, he asked about the net worth of the current CEO of Twitter and was given an answer that is wrong. The model said Jack Dorsey, who hasn’t been the CEO of the company since late 2021 and claimed his net worth was 2.5 billion. Both of these were factually incorrect, which shows that there is a problem with the model providing factual information. Graham Neubig continued to display the issues related to generated texts in both coherence and plausibility.
He touched on how evaluating text generated by a model is about as hard as it is to program said model. In short, there’s a trust disconnect due to uncertainty about what’s being generated. This is creating a bottleneck in the NLP development pipeline during the evaluation result phase when you’re training your model’s training data with test data in the pursuit of achieving your specific level of accuracy.
So he asks a simple question: How does a team know if the NLP model is doing well? According to Graham, the “Gold Standard” is by manual evaluation. To do this, a team takes the source and tests it against a few hypotheses with an annotator who will review each hypothesis and score them. The scores in general are based on the task the NLP model is attempting to solve.
But with this solution, there are issues. First of which is related to resources. To use an annotator to go through text manually is time-consuming which in turn can be expensive. Secondly, if the human who is in charge of reviewing the data isn’t properly trained or motivated, then the results of the manual evaluation could easily be counter-productive.
Using a human annotator isn’t the only solution though. According to Graham, there is an alternative – using an automatic evaluation process. The system will take the source material and all of the hypotheses, and even a human-generated reference for the system to have an idea of what the output is expected to look like if generated correctly. From there the process is identical to using a human using a scoring system based on the task at hand.
Next. Graham Neubig touched on the push evaluation methods for NLP models by using other NLP models. In his one example, he takes a reference and a candidate output and runs them through a language model, BERTScore. This gives an embedding-based evaluation that can provide maximum similarity by scoring matches. Another example of this technique uses COMET. This uses a human evaluator to score the source and the program to score the hypothesis, then by calculating the difference you can create a loss function. The next step is updating both the model and COMET as a means to better refine the output of the model, increasing accuracy.
Throughout the rest of the Keynote, Graham Neubig answers questions from the audience. Though the keynote was virtual, ODSC West provided a world-class virtual experience that bridged speakers and their audience. Overall, NLP models have come far, yet there are still many issues that can hamper any project. Evaluating, training, and testing have come a long way, but there is still plenty to work through.
If you found this keynote interesting, then you shouldn’t miss the next ODSC conference, ODSC East! Tickets are now 75% off for a limited time, so don’t delay!