Transformer models have taken the world of natural language processing (NLP) by storm. They went from beating all the research benchmarks to getting adopted for production by a growing number of companies in a record number of months. Some of the applications of these models include text classification, information extraction, text generation, machine translation, and summarization.
However, given the complexity in the underlying architecture, these transformer models are still hard to train and deploy at scale. Training can take days and the process of fine-tuning critical parameters is involved and complex. Transformer models also need highly scalable and available environments for inference and deployment.
Today we are sharing how the ONNX Runtime team and Hugging Face are working together to address and reduce these challenges in training and deployment of Transformer models. The result is a solution that simplifies training and reduces costs for inferencing.
Making NLP more Accessible
Hugging Face is a company creating open-source libraries for powerful yet easy to use NLP like tokenizers and transformers. The Hugging Face Transformers library provides general purpose architectures, like BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, and T5 for Natural Language Understanding (NLU) and Natural Language Generation (NLG). It currently includes thousands of pretrained models in 100+ languages. These models are both easy to use, powerful and performant for many NLP tasks. Model training, evaluation, and sharing can be achieved through a few lines of code. The library also enables deep interoperability between PyTorch and TensorFlow and the flexibility to select the right framework for training, evaluation, and deployment.
ONNX Runtime helps accelerate PyTorch and TensorFlow models in production, on CPU or GPU. As an open source library built for performance and broad platform support, ONNX Runtime is used in products and services handling over 20 billion inferences each day. ONNX Runtime has optimizations for transformer models with up to 17x speedup. These improvements in latency, throughput, and costs make deploying transformer models more practical.
You can now use ONNX Runtime and Hugging Face Transformers together to improve the experience of training and deploying NLP models. Hugging Face has made it easy to inference Transformer models with ONNX Runtime with the new
convert_graph_to_onnx.py which generates a model that can be loaded by ONNX Runtime.
Higher performance NLP inference
Inference performance is dependent on the hardware you run on, the batch size (number of inputs to process at once), and sequence length (size of the input). If you have access to a GPU, inferencing will be faster than on a CPU. While larger batch sizes are useful during training and offline processing, we typically use batch size of 1 for online inferencing. Sequence lengths vary based on the scenario: shorter lengths are used for processing queries while Q&A and summarization scenarios use longer sequence lengths.
We measured the latency of three Hugging Face Transformer models using several batch sizes and sequence lengths on the same CPU and GPU configurations. CPU performance measurement was done on a desktop machine with an Intel® Xeon® E5-2620 v2 processor containing 12 logical cores. For GPU, we used one NVIDIA V100-PCIE-16GB GPU on an Azure Standard_NC12s_v3 VM and tested both FP32 and FP16. We used an updated version of the Hugging Face benchmarking script to run the tests. For PyTorch, we used PyTorch 1.5 with TorchScript. For PyTorch + ONNX Runtime, we exported Hugging Face PyTorch models and inferenced with ONNX Runtime 1.3.
On a GPU in FP16 configuration, compared with PyTorch, PyTorch + ONNX Runtime showed performance gains up to 5.0x for BERT, up to 4.7x for RoBERTa, and up to 4.4x for GPT-2. We saw smaller, but still significant, speedups for GPU/FP32 and CPU configurations.
Smaller sequence lengths generally showed more gains than larger sequence lengths on GPU. Our detailed data is shared at the end of this post.
We’d like to show how you can incorporate inferencing of Hugging Face Transformer models with ONNX Runtime into your projects. You can also do benchmarking on your own hardware and models.
The steps are:
- Export your Hugging Face Transformer model to ONNX
Run the conversion script located at
transformers/convert_graph_to_onnx.py. This script takes a few arguments such as the model to be exported and the framework you want to export from (PyTorch or TensorFlow).
python convert_graph_to_onnx.py --framework pt --model bert-base-cased bert-base-cased.onnx
2. Apply latest ONNX Runtime Optimizations
ONNX Runtime automatically applies most optimizations while loading the model. Some of the latest optimizations that are not yet integrated into ONNX Runtime are available as a script that tunes models for the best performance.
You can access the optimization script and run it on your model with these commands:
pip install onnxruntime-tools python -m onnxruntime_tools.optimizer_cli --input bert-base-cased.onnx --output bert-base-cased.onnx --model_type bert
--mode_type parameter triggers specific optimization strategies. The script also provides a flag
--float16 to leverage mixed precision performance gains from newer GPUs. You should also use this script if you are using the TensorFlow version of the models. The options and usage are further described in the ONNX Runtime repository.
3. Inference with ONNX Runtime
ONNX Runtime is written in C++ for performance and provides APIs/bindings for Python, C, C++, C#, and Java. It’s a lightweight library that lets you integrate inference into applications written in a variety of languages.
Below is what the code looks like in Python. In Python, we use the tokenizer from the Hugging Face library. In other languages you may need to implement your own tokenizer to process the string input and turn it into tensors the model expects as inputs.
You can find a notebook showing all the steps in the Hugging Face GitHub repo in the link below.
We hope this has inspired you to try out Hugging Face Transformer models with ONNX Runtime. We’d love to hear about your experiences in the comments. The ONNX Runtime team is continually improving performance, so keep an eye out for even more improvements on more models. You can also participate in our GitHub repos (Hugging Face Transformers library and ONNX Runtime).
In future blogs we’ll discuss more optimizations, including how you can use quantization to reduce the size of your model and improve performance in newer hardware. Stay tuned!
For the GPT2 test, we disabled past state input/output. Enabling past state can help reduce computation by reusing intermediate results. Past state optimizations are being added to ONNX Runtime which will further help improve performance when using large sequence sizes.