This tutorial demonstrates a few features of PyTorch Profiler that have been released in v1.9. PyTorch. Profiler is a set of tools that allow you to measure the training performance and resource consumption of your PyTorch model. This tool will help you diagnose and fix machine learning performance issues regardless of whether you are working on one or numerous machines. The objective is to target the execution steps that are the most costly in time and memory and visualize the workload distribution between GPUs and CPUs. This effort is a collaboration between Microsoft and PyTorch to help PyTorch users execute their models faster and address model performance bottlenecks. View other collaborations between Microsoft and PyTorch here.
This tutorial will run you through a batch size optimization scenario on a Resnet18 model.
The basic usage of PyTorch Profiler is introduced here. In this tutorial, we will use the same code but turn on more switches to demonstrate more advanced usage of the PyTorch Profiler on TensorBoard to analyze model performance.
To install torch, torchvision, and Profiler plugin use the following command:
pip install torch torchvision torch-tb-profiler
1. Prepare the data and model
Below we will grab the dataset CIFAR10 which is an open-source dataset built into torchvision. From there we will use transfer learning with the pre-trained model resnet18. Then we will train the model.
#import all the necessary libraries import torch import torch.nn import torch.optim import torch.profiler import torch.utils.data import torchvision.datasets import torchvision.models import torchvision.transforms as T #prepare input data and transform it transform = T.Compose( [T.Resize(224), T.ToTensor(), T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) # use dataloader to launch each batch train_loader = torch.utils.data.DataLoader(train_set, batch_size=1, shuffle=True, num_workers=4) # Create a Resnet model, loss function, and optimizer objects. To run on GPU, move model and loss to a GPU device device = torch.device("cuda:0") model = torchvision.models.resnet18(pretrained=True).cuda(device) criterion = torch.nn.CrossEntropyLoss().cuda(device) optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) model.train() # define the training step for each batch of input data def train(data): inputs, labels = data.to(device=device), data.to(device=device) outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step()
2. Use the profiler to record execution events
Now that we have set-up a basic model let us enable the optional features in the profiler to record additional information during training. This will provide insights into improving performance. The parameters that we will include are:
- repeat in schedule – “schedule” is callable that takes a step (int) as a single parameter and returns the profiler action to perform at each step. In this example with repeat=2, profiler will record 2 spans, each span consists of 1 wait step, 1, warmup step and 3 active steps. For more information about wait/warmup/active, you can find it here.
- profile_memory - Track tensor memory allocation/deallocation. Although, setting the profiler_memory=True may also cost you a few minutes to complete.
- with_stack - Record source information (file and line number) for the ops. If the TensorBoard is launched in VSCode, clicking on a stack trace will jump to the code file and line.
The below code will enable the profiler:
with torch.profiler.profile( schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18_batchsize1'), record_shapes=True, profile_memory=True, with_stack=True ) as prof: for step, batch_data in enumerate(train_loader): if step >= (1 + 1 + 3) * 2: break train(batch_data) prof.step() # Need call this at the end of each step to notify profiler of steps' boundary.
3. Run training step with Profiler
Now that we have added the profiler code to our train step, the profiling result will be saved under ./log directory. Specifying this directory in cmd will allow you to analyze the profiler in Tensorboard.
4. Use TensorBoard to view results and analyze the performance
We have now run the model training with the profiler code added. Let’s launch TensorBoard and look at the metrics and insights we were able to log. Side note, having both TensorBoard and the PyTorch Profiler being integrated directly in VS Code gives you the ability to directly jump to the source code (file and line) from the profiler stack traces as shown here.
Launch the TensorBoard.
Open the TensorBoard profile URL in Google Chrome browser or Microsoft Edge browser.
The PyTorch Profiler plugin is shown below:
Alternatively, if you don’t want to run through the above steps, to save time, you may try the end-to-end example by running “tensorboard –logdir= https://torchtbprofiler.blob.core.windows.net/torchtbprofiler/demo/”.
5. Optimize performance
The best way to interpret the overview page is to start with the “GPU Summary”. From the above “GPU Summary” panel, we can see the “GPU Utilization” is only 8.6%. That is incredibly low as the ideal GPU Utilization is 100% as it means the GPU is busy all the time doing data crunching. In the “Execution Summary”, we can see that about 63% of the execution time is spent on the CPU side.
Since the GPU Utilization is incredibly low, we can observe the “Performance Recommendation” at the bottom panel. This panel provides suggestions on how to optimize your model to increase your performance, in this case, GPU Utilization. In this example, the recommendation suggests we increase the batch size. We can follow it, increase batch size to 32.
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)
Then change the trace handler argument that will save results to a different folder:
Then run the program again. Restart TensorBoard and switch the “run” option to “resent18_batchsize32”.
After increasing the batch size, the “GPU Utilization” increased to 51.21%. Way better than the initial 8.6% GPU Utilization result. In addition, the CPU time is reduced to 27.13%. The overall time of training 32 samples is reduced to 61.8ms, comparing with the previous 54.5*32=1744ms with batch size as 1.
6. Analyze the performance
Batch size is a number that indicates the number of input feature vectors of the training data. This affects the optimization parameters during that iteration. Usually, it is better to tune the batch size loaded for each iteration to balance the learning quality and convergence rate.
In the run with batch size 1, the operator’s device time is shorter than its host time. It means it doesn’t fully utilize the GPU because more time is spent on the CPU side which does routine work such as kernel launching other than computation.
In the run with batch 32, the operator’s device time is far more than its host time. It means this operator can better utilize GPU if we increase the batch size.
To dig further, please switch to GPU kernel view. This view shows each CUDA kernel’s time cost. For easy comparisons, change to “Kernel Name + Op Name” in “Group By”, and search by the same operator’s name as “cudnn_convolution”. The run with batch size 1 is as below.
Then change to the run with batch size 32 and search same operator name. It shows the kernel list of the same operator could be changed with batch size. Kernels’ execution time with batch size 32 is increased, and their “Mean Blocks per SM” and “Mean Est Achieved Occupancy” are also mostly increased, which stands for higher utilization on GPU.
Finally, switch to “Trace view”. This view visualizes the execution timeline, both on the CPU and GPU side. In the run with batch size 1, both the “GPU Utilization” and “GPU Estimated SM Efficiency” are low.
In the run with batch size 32, both metrics are increased.
The trace view can be zoomed in to see more detailed information. The run with batch size 1 has a very “sparse” kernel timeline.
While the run with batch size 32 has a “dense” kernel timeline.
From this example, we can conclude that the bigger batch size could lead to a bigger kernel, which increases GPU utilization and results in better training speed.
7. Analyze the memory
The profiler records every memory allocation/release event during profiling. For every specific operator, the plugin aggregates all these events inside its life span.
Switch to the memory view. And select “GPU0” in the “Device” selection box.
For the run with batch size 1, the memory usage is as below.
For the run with batch size 32, the memory usage is greatly increased. That’s because PyTorch must allocate more memory for input data, output data, and especially activation data with the bigger batch size. Users should take care to keep memory usage not to exceed the upper bound of the GPU.
What’s Next for the PyTorch Profiler?
You just saw how PyTorch Profiler helped optimize a Resnet model with the CIFAR10 dataset. You can now try the Profiler by pip install torch-tb-profiler to optimize your PyTorch model.
Look out for an advanced version of this tutorial in the future. If you want tailored enterprise-grade support for this, check out PyTorch Enterprise on Azure. We are also thrilled to continue to bring state-of-the-art tool to PyTorch users to improve ML performance. We’d love to hear from you. Feel free to open an issue here.
For new and exciting features coming up with PyTorch Profiler, follow us @PyTorch on Twitter and check us out on pytorch.org.
About the authors:
Teng Gao is a Principal Software Engineer working on developing deep learning frameworks and PyTorch Profiler at Microsoft.