Deploying advanced machine learning technology to serve customers and/or business needs requires a rigorous approach and production-ready systems. This is especially true for maintaining and improving model performance over the lifetime of a production application. Unfortunately, the issues involved and the approaches available are often poorly understood.
Using an ML model in a product, service, or business process means that the model results need to be repeatable and predictable. This includes ensuring a level of accuracy across different subsets of your data. For example, predicting network performance at 3 PM needs to be as accurate as predicting at 3 PM, or predicting for Nebraska needs to be as accurate as predicting for California. In addition, as you accumulate new data and train new versions of your model, you need to make sure that your model is actually improving.
As you operate an ML model in a production environment you will accumulate artifacts such as versions of your model and dataset, and metrics and statistics for each version. Keeping a lineage graph of these artifacts becomes important for both governance and the ability to restore previous versions when necessary, as well as studying the evolution of your data and model.
Large models make rigorous engineering and scalable architectures even more important. Just the size of the models themselves, and the datasets used for training, require highly efficient infrastructure. More complex pipeline topologies which include transfer learning from pre-trained foundation models, prompting, fine-tuning, instruction tuning, chaining, and complex task-specific evaluation, require a high degree of flexibility for customization.
This further complicates the rigorous analysis of model performance at a deep level, including edge and corner cases, which is a key requirement of mission-critical applications. Measuring and understanding model sensitivity is also part of any rigorous model development process.
To meet these needs ML pipeline architectures are the current state of the art for implementing production ML applications. Google uses TFX for large-scale ML applications and offers an open-source version to the ML community, which is actively extending TFX to add new features and components. Uber uses a similar framework called Michaelangelo, and others have developed similar frameworks as well. One key requirement is the ability to distribute processing across a cluster of compute nodes, which TFX by using Apache Beam. This is necessary because of the large amounts of data that are processed – literally sometimes petabytes of data – and the large compute requirements of increasingly large models.
Increasingly sophisticated techniques for structuring model queries to improve results, known as “prompting”, have been added to the arsenal of techniques available for adapting pre-trained models to specific tasks. When a well-designed prompt can deliver adequate results there is no additional cost for training required to adapt the model.
While pre-training models from the beginning is typically only done by large companies or institutions, even fine-tuning a pre-trained model requires substantial resources. This has led to the development of various techniques for parameter-efficient fine-tuning (PEFT), which is increasingly used to adapt pre-trained models for specific tasks. This includes the use of Low-Rank Adaptation (LoRA). Sometimes however full fine-tuning of all of the model’s parameters is necessary to achieve the required results, and can be expensive.
All of these have introduced new artifacts – prompts, PEFT weights, fine-tuning datasets, etc – to the set of items that must be captured, managed, and tracked. It has also introduced new pipeline topologies and new client requirements. As if that weren’t enough it is all evolving very quickly, with new techniques being announced nearly every day.
About the author/ODSC Europe 2023 speaker:
A data scientist and ML enthusiast, Robert Crowe has a passion for helping developers quickly learn what they need to be productive. Robert is currently the Senior Product Manager for TensorFlow Open-Source and MLOps at Google and helps ML teams meet the challenges of creating products and services with ML. Previously Robert led software engineering teams for both large and small companies, always focusing on moving fast to implement clean, elegant solutions to well-defined needs. You can find him on LinkedIn at robert-crowe.