Editor’s note: Jennifer Davis is a speaker for ODSC West 2022 coming this November. Be sure to check out her talk, “Large Scale Deep Learning using the High-Performance Computing Library OpenMPI and DeepSpeed,” there!
This article covers the use of the message passing interface (MPI) and Microsoft’s DeepSpeed to execute deep learning with parameter-heavy workloads. DeepSpeed (which uses the message-passing interface) is optimized for models of upwards of a trillion parameters. We will review briefly how MPI works to speed up code and go over a canonical example of calculating Pi. Afterward, we will discuss applications of DeepSpeed and its ZeRO algorithms. The article is high-level material geared towards data scientists.
High-performance computing is necessary for many industries as data growth has occurred exponentially. In contrast, the turnaround time for machine learning and deep learning analyses and products remains relatively tight in today’s competitive markets.
DeepSpeed running with MPI is helping data scientists build smarter and deploy faster. These technologies apply to a wide range of industries and use cases from distributed training of deep learning models to large-scale simulations. Every data scientist should have a basic understanding of high-performance computing methodology and implementations.
What is OpenMPI, and how is it connected to High-Performance Computing?
Distributed systems are complex. Handling communication, memory, message exchange and tracking what operations are taking place within a distributed system can be difficult. MPI manages these tasks for the user. Using the message passing interface (MPI) via the open source library OpenMPI accomplishes this with minimal effort and enables model parallelism in shared memory scenarios.
Model parallelism works best for running computations or models across multiple machines (i.e., multiple CPUs or GPUs). To accomplish model parallelism, both the data and models need correct partitioning via correctly written code and library usage. OpenMPI manages the communication, synchronization, and data transfer amongst machines in this paradigm. Performing parallelized machine learning computations using OpenMPI requires that data exist in batches spread across machines and parameter updates are centralized on a parameter server. The machine learning model installed on each machine then trains or infers across the batched data. The parameter gradients go to the server, and the current values of model parameters are updated. An illustration of this method of training over sharded datasets is below(1):
Below here is a code snippet of a canonical example of calculating Pi using a Monte Carlo inference technique. See the Wikipedia article Monte Carlo Method for an explanation of the Monte Carlo inference of Pi (3). The total time to calculate Pi on a single GPU with 108 points is about 1 minute.
import time import math import random import os def sample(num_samples): num_inside = 0 for _ in range(num_samples): x, y = random.uniform(-1, 1), random.uniform(-1,1) if math.hypot(x, y) <= 1: num_inside +=1 return num_inside num_inside = sample(10**8) pi = (4*num_inside)/num_samples print(pi) Wall time: 59.8 s, GPU AWS instance: g4dn-16xlarge, 32 CPU cores, 256 GB memory per GPU, 1 instance
Our second example, the library mpi4py, is used as a Python API to OpenMPI. We apply the concept of shared data calculations of Pi. Notice that there is a file code (where the data for Pi calculations is distributed among workers). The calculation is accomplished together by the workers. The increase in speed is significant at almost 5x faster than calculating Pi without MPI. Using OpenMPI with six GPUs (AWS instance: g4dn-16xlarge, 32 CPU cores, 256 GB memory) takes under 10 seconds.
from mpi4py import MPI import numpy import sys comm = MPI.COMM_SELF.Spawn(sys.executable, argos= [‘cpi.py’], maxprocs=6) N = numpy.array(10**8, ‘i’) comm.Bcast([N, MPI.INT], root=MPI.ROOT) PI = numpy.array(0.0, ‘d’) comm.Reduce(None, [PI, MPI.DOUBLE], op=MPI.SUM, root=MPI.ROOT) print(PI) comm.Disconnect() Wall time: 9.79 s, AWS instance: g4dn-16xlarge, 32 CPU cores, 256 GM memory per GPU, 6 instances
Contents of the cpi.py file are shown below:
#!/usr/bin/env python from mpi4py import MPI import numpy comm = MPI.Comm.Get_parent() size = comm.Get_size() rank = comm.Get_rank() name = MPI.Get_processor_name() N = numpy.array(0, dtype='i') comm.Bcast([N, MPI.INT], root=0) h = 1.0 / N; s = 0.0 for i in range(rank, N, size): x = h * (i + 0.5) s += 4.0 / (1.0 + x**2) PI = numpy.array(s * h, dtype='d') comm.Reduce([PI, MPI.DOUBLE], None, op=MPI.SUM, root=0) comm.Disconnect()
The above trivial example illustrates the power of OpenMPI and parallelism. This technology forms part of the backbone of a new library released by Microsoft in 2020 called DeepSpeed.
DeepSpeed and the Zero Redundancy Optimizer
The DeepSpeed library, which runs on OpenMPI, is an open source library released by Microsoft in 2020. The library offers significant speed gains while maintaining the accuracy of PyTorch, Hugging Face, and PyTorch Lightning-built machine learning models. Pre-existing solutions suffered from limited device memory, memory redundancy, and communication inefficiency (2). The Zero Redundancy Optimizer (ZeRO) (a memory usage optimizer),was invented by scientists at Microsoft. DeepSpeed contains memory optimization algorithms in versions ZeRO, ZeRO-2 and ZeRO-3. This novel optimizer eliminates memory redundancy in both sharded data and model parallel situations. The models run with low communication memory volumes and high-computational parallelism. This approach overcomes the fundamental problems with large-scale training of models and works with upwards of a trillion training parameters.
In our workshop at ODSC West 2022, we will leverage DeepSpeed for a simple computer vision model and a significant language summarization model (based on the DailyMail news stories dataset). These powerful techniques are presently used in proteomics, news summarization, and other industry-specific applications.
What you will learn
This two-hour workshop will be divided into a lecture, a discussion, and a hands-on tutorial. We’ll cover the basics of machine learning with DeepSpeed, and OpenMPI using the Python programming language. We will briefly discuss using OpenMPI and the R programming language. Still, all exercises will be in Python. Come join in and learn how you can up your machine learning and deep learning work using the high-performance computing tool OpenMPI and deep learning distributed computing tool, DeepSpeed.
- Large Scale Distributed Deep Networks. J. Dean, G. Corrado, R. Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, A. Senior, P. Tucker, Ke Yang, A. Ng. NIPS 3 December 2012
- Zero: Memory Optimizations Toward Training Trillion Parameter Models. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. SC ’20: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis, November 2020 Article No.: 20, Pages 1–16
- Wikipedia (English). Monte Carlo Method. Accessed Aug 1, 2022
About the Author/ODSC West 2022 Speaker:
Dr. Jennifer Davis is a Staff member of the Field Data Science Team at Domino Data Lab, where she empowers clients on complex data science and deep learning projects. Jennifer completed two postdocs in computational and systems biology, trained at a supercomputing center at the University of Texas, Austin, and worked on hundreds of consulting projects and training with companies ranging from start-ups to the Fortune 100. She previously presented topics at conferences such as the Association for Computing Machinery on LSTMs and Natural Language Generation, IEEE on Computer Vision, and ODSC-West on Reinforcement Learning, as well as conferences across the US and Europe. Jennifer was part of a panel discussion for an IEEE conference on artificial intelligence in biology and medicine in 2017. In 2022 she presented some of her work that leverages distributed computing at the IEEE Engineering in Medicine and Biology Conference.