Natural Language Processing (NLP) has advanced significantly since 2018, when ULMFiT and Google’s release of the BERT language model approached human-level performance on a range of use cases. Since then, several models with similarly interesting names such as XLM, GPT-2, XLNet, and ALBERT have been released in quick succession, each improving on its predecessors. While these state-of-the-art models can solve human-level, language-based tasks on large volumes of unstructured text for certain use cases, getting a handle on what to use, when to use it, and how to use it can be a challenge.
At Novetta, we explored what it would take to streamline the implementation of state-of-the-art models for different NLP tasks allowing for quick use by practitioners.
[Related article: Level Up: spaCy NLP for the Win]
We have developed an open-source framework, AdaptNLP, that lowers the barrier to entry for practitioners to use these advanced capabilities. AdaptNLP is built atop two open-source libraries: Transformers (from Hugging Face) and Flair (from Zalando Research). AdaptNLP enables users to fine-tune language models for text classification, question answering, entity extraction, and part-of-speech tagging.
To show how AdaptNLP can be put to use, we will address a Question Answering (QA) task using BERT. This task automates the answering of questions, posed by humans, against a corpus of text.
Using AdaptNLP starts with a Python pip install.
pip install adaptnlp
First, we import EasyQuestionAnswering which abstracts transformer-based Question Answering tasks to their most basic components.
from adaptnlp import EasyQuestionAnswering
We can now frame our question as a simple string. The context variable holds the source text that we want to search through for an answer. Because a question may have multiple valid answers, we specify how many results to return using top_n.
## Example Query and Context query = "What does Novetta do?" context = "Novetta pioneers disruptive technologies in data analytics, full-spectrum cyber, media analytics, and multi-INT fusion. Novetta brings actionable insights to your most complex data challenges. We enable customers to find clarity from the noisy complexity of big data at the speed and scale of the most intensive national security missions." top_n = 5
We now use predict_qa(), which defaults to a pre-trained BERT-based QA model, to determine what part of the corpus may be our answer. We then pass the question, the context data, and the number of answers we would like to see. The results contain the text the model believes to be the answer, a probability score, and the locations of this answer as it relates to the original corpus.
## Load the QA module and run inference on results qa = EasyQuestionAnswering() best_answer, best_n_answers = qa.predict_qa(query=query, context=context, n_best_size=top_n)
We can now take a look at best_answer to see the most relevant result or best_n_answers to see the number of answers that we previously specified.
## Output top answer as well as top 5 answers print(“Best Answer:\n”, best_answer) print(“Best n Answers:\n”, best_n_answers) Best Answer: 'brings actionable insights to your most complex data challenges' Best n Answers: [OrderedDict([('text', 'brings actionable insights to your most complex data challenges'), ('probability', 0.5482685518182449), ('start_index', 15), ('end_index', 23)]), OrderedDict([('text', 'pioneers disruptive technologies'), ('probability', 0.11097169321630729), ('start_index', 1), ('end_index', 3)]), OrderedDict([('text', 'brings actionable insights to your most complex data challenges. We enable customers to find clarity from the noisy complexity of big data'), ('probability', 0.07600267159691482), ('start_index', 15), ('end_index', 36)]), OrderedDict([('text', ''), ('probability', 2.8758867595942603e-08), ('start_index', 0), ('end_index', 0)])]
Note: We have limited the example output to three results for brevity and to demonstrate variety.
[Related article: Introduction to Spark NLP: Foundations and Basic Components]
These outputs can easily be integrated into user-built systems by providing text-based metadata such as extracted answer text, start/end indices, and confidence scores. By standardizing the input and output data and function calls developers can easily use NLP algorithms regardless of which model is used in the backend. Before AdaptNLP, we would individually integrate the latest released model and pre-trained weights and then reiterate through a build for an NLP task pipeline. This time-consuming and repetitive process was also due to the rapid advancements and releases of NLP models. To overcome this, AdaptNLP provides a streamlined process that can leverage new models in existing workflows without having to overhaul code.
Using the latest transformer embeddings, AdaptNLP makes it easy to fine-tune and train state-of-the-art token classification (NER, POS, Chunk, Frame Tagging), sentiment classification, and question-answering models. We will be giving a hands-on workshop on using AdaptNLP with state-of-the-art models at ODSC East 2020 in Boston.
Andrew and Brian are speakers for ODSC East 2020 this April 13-17 in Boston. Be sure to check out their talk, “State-of-the-art NLP Made Easy,” at this upcoming event!
About the speakers/authors:
Brian Sacash is a Lead Machine Learning Engineer in Novetta’s Machine Learning Center of Excellence. He helps various organizations discover the best ways to extract value from data. His interests are in the areas of Natural Language Processing, Machine Learning, Big Data, and Statistical Methods. Brian holds a Master of Science in Quantitative Analysis from the University of Cincinnati and a Bachelor of Science in Physics from Ohio Northern University.
Andrew Chang is a Senior Machine Learning Engineer in Novetta’s Machine Learning Center of Excellence. Andrew is a graduate from Carnegie Mellon University who has a focus on researching state of the art machine learning models and rapid prototyping ML technologies and solutions across the scope of customer problems. He has an interest in open source projects and research in natural language processing, geometric deep learning, reinforcement learning, and computer vision. Andrew is the author and creator of AdaptNLP.