Editor’s note: Kabir Nagrecha is a speaker for ODSC West 2023 this Fall. Be sure to check out his talk, “Democratizing Fine-tuning of Open-Source Large Models with Joint Systems Optimization,” there!
Model scale has become an absolutely essential aspect of modern deep learning practice. The success of multi-billion parameter large language models (LLMs) has encouraged practitioners to push the limits of model size. Researchers & practitioners are now building increasingly large and memory-hungry model architectures to boost accuracy and model quality.
But these scales introduce infrastructural challenges that can’t be ignored. The resource demands of model training have shot through the roof. And if there’s one thing we’ve learned from our collaborations at UCSD and with our industry partners, it’s that deep learning jobs are never run in isolation. If you’re training one model, you’re probably training a dozen — hyperparameter optimization, multi-user clusters, & iterative exploration all motivate multi-model training, blowing up compute demands further still.
In this blog post, we’ll outline our work on Saturn, our recently open-sourced tool for training multiple large models simultaneously. Saturn integrates seamlessly with industry-standard tools for large model training such as HuggingFace, PyTorch’s FSDP, Google’s GPipe, FairScale, XFormers, and more. We’ll demonstrate how Saturn can help optimize execution in this new & critical setting, reducing costs & boosting training efficiency.
Why train large models? And why train many of them?
“Increasing the size of our models has definitely boosted our accuracy. It’s become a general rule — larger is better.” (Saturn user in industry)
Recent studies have demonstrated that larger models tend to perform better on a variety of benchmarks.
So, if you’re looking to build the most accurate model to power your application, you’re probably going to use a large-scale model architecture — maybe a LLaMA model, maybe GPT-J, maybe even GPT-2.
It seems clear that there’s ample motivation to train large models. But the core idea behind Saturn is to optimize multiple large models at once.
“We probably go through a hundred…maybe a thousand different model iterations before we push anything to production. And this process repeats daily for re-training.” (Saturn user on their production pipelines)
Deep learning in practice almost always consists of multiple training jobs. Industry clusters receive jobs from hundreds of users & pipelines. Automated production pipelines might trigger dozens of training jobs with different configurations (e.g. learning rates, batch sizes, model architecture variations) to find the most accurate one to deploy. Individual researchers might submit multiple exploratory jobs to evaluate different approaches. By the end of the day, a cluster might have received thousands of jobs to manage.
There’s an unsolved problem out there, intersecting the challenges of large-model training and multi-model execution that needs to be solved. With Saturn, we explicitly optimize for this setting.
What is Saturn? And how can I use it?
When tackling multiple large models simultaneously, there are three critical problems to consider.
First, parallelization. Large models typically need multiple GPUs for memory distribution and higher training throughput. Designing the parallel execution scheme is a challenge though — there are dozens (or even hundreds!) of different techniques!. FSDP, pipeline parallelism, 3D parallelism, and more — the best one to choose will depend on your GPUs, interconnects, model architecture, and even hyperparameters. It’s a difficult space to navigate, but choosing the wrong approach can have serious performance implications.
Second, resource apportioning. Given a cluster with 100 GPUs, and 30 submitted jobs, how should the GPUs be distributed over the jobs? Optimizing throughput requires automatically finding an optimized resource distribution plan.
Third, scheduling. How should execution be ordered to minimize end-to-end running times? Should Model A be run before Model B? Or Model C before Model A?
Notice that each of these problems are connected. The resource apportionment plan you use will constrain the schedule and also influence optimal parallelism selections, and vice versa. So we have to solve this as a joint problem.
And that’s exactly what Saturn does. With Saturn, you can just submit a batch of training jobs, and watch as a plan solving all of these problems is auto-generated. Saturn runs a quick profiling scan over all your jobs, then uses a mixed-integer linear programming solver to produce an optimized plan for execution. It’s easy to extend and register new parallelization techniques, so there’s no risk of falling behind as research advances. Combined with some mechanisms such as introspection for efficient scheduling, it’s able to dramatically reduce running times of batched multi-model jobs. We find that in practice, it’s able to reduce execution times and costs by as much 2X versus current practice, all while reducing developer workloads.
Using Saturn to run your jobs more efficiently is a simple process. Our GitHub repository contains an example workflow, but here’s a short walkthrough.
First, wrap your model loading/initialization function in Saturn’s “Task” construct.
from saturn import Task t1 = Task(load_model, load_dataloader, loss_function, hyperparameters)
Next, register any parallelization techniques you want to use with Saturn’s “Library.”
from saturn.library import register from saturn.core.executors.Technique import BaseTechnique class MyExecutor(BaseTechnique): def execute(task, gpus, task_id, batch_count): # train the model return def search(task, gpus, task_id): # optimize any internal parameters # these are automatically associated with the submitted task return parameters register(“my_parallelism”, MyExecutor)
Finally, submit a list of tasks for profiling and execution.
from saturn.trial_runner.PerformanceEvaluator import search from saturn import orchestrate search([t1, t2, t3, t4]) # profiles and evaluates each task orchestrate([t1, t2, t3, t4]) # orchestrates and manages execution
Conclusions & Takeaways
Saturn enables dramatic speedups of >2X for workloads that train multiple large models simultaneously. It’s perfectly compatible with standard tooling such as HuggingFace. The next time you’re trying to fine-tune a LLaMA model, or build your own LLM, consider using Saturn to reduce costs & accelerate runtimes! If you’re interested in learning more, you can find our paper on Saturn here.
We’re currently looking to onboard new contributors and users, so if you’re interested in Saturn, consider checking out our recently open-sourced repository on GitHub at https://github.com/knagrecha/saturn!
About the Author/ODSC West 2023 Speaker:
Kabir Nagrecha is a Ph.D. candidate at UC San Diego, working with Professors Arun Kumar & Hao Zhang. He is the recipient of the Meta Research Fellowship, as well as fellowships from the Halicioglu Data Science Institute and Jacobs School of Engineering at UCSD.
Kabir is the youngest-ever Ph.D. student at UCSD, having started his doctorate at the age of 17. He’s previously worked with companies such as Apple, Meta, & Netflix to build the core infrastructure that supports widely-used services such as Siri & Netflix’s recommendation algorithms.