fbpx
How to Choose the Best GPU Optimized VM Sizes for Your Project on Azure How to Choose the Best GPU Optimized VM Sizes for Your Project on Azure
A common problem that data scientists face when training and deploying their machine learning models is the choice of the right... How to Choose the Best GPU Optimized VM Sizes for Your Project on Azure

A common problem that data scientists face when training and deploying their machine learning models is the choice of the right type and size of hardware.

Migrating machine learning tasks on the Cloud significantly simplified the data scientist’s job, who now just needs to login into the Azure portal or Azure ML studio and select from a wide range of available resources – with different sizes, capabilities, and costs – the most suitable one for their use case. On the other hand, this long list of options could sometimes intimidate even experienced users. At the same time, GPU (Graphics Processing Unit) acceleration is a new and rapidly evolving field and there is no true one-size-fits-all guidance for this product area. The goal of this blog post is to provide some guidelines that could help in this non-trivial task.

GPUs are ideal for compute and graphics-intensive workloads, suiting scenarios like high-end remote visualizationdeep learning, and predictive analytics. The N-series is a family of Azure Virtual Machines with GPU capabilities, which means specialized virtual machines available with single, multiple, or fractional GPUs.

The main factors which influence the choice of compute resource type to use for a job are the following:

1. Performance: Certain frameworks and algorithms used to train a model may need less time to complete the training task but come at a higher cost. GPU performance varies per workload but a quick overview can be found on NVIDIA’s website.

2. Cost: Depending on requirements, a user may prefer a model that either is more cost-effective or performs better. If cost saving is a major requirement, reserved instances for virtual machines or low-priority virtual machines can be explored as solutions.

3. Location: Virtual machines availability may differ across Azure regions. Also, if data needs to remain within a certain region as a requirement, this can affect the model’s choice. Resources availability per region might be explored using the VMs selector.

4. GPU memory size: Deep learning models benefit from the right selection of GPU memory size. The choice of GPU memory size is affected by the memory requirements for the model to train (e. g. size of the dataset and number of parameters). 

5. Use-case scenario:

    • The NC-series (powered by NVIDIA K80 GPUs), NCv2-series (powered by NVIDIA Tesla P100) – which will be retired by August 2023 – and the newest NCv3-series (powered by NVIDIA Tesla V100) are used for machine-learning and high-performance computing workloads (reservoir modeling, DNA sequencing, protein analysis, Monte Carlo simulations). The NC series is also a popular choice for developers and students learning about, developing for, or experimenting with GPU acceleration.
    • The ND-series (powered by NVIDIA Tesla P40) – which will be retired by August 2023 – and the newest NDv2-series (powered by 8 NVIDIA Tesla V100 NVLINK-connected GPUs) and ND A100 v4-series (powered by 8 NVIDIA Ampere A100 Tensor Core GPUs) are designed for training and inferencing scenarios for deep learning.
    • The NV-series (powered by NVIDIA Tesla M60) – that will be retired by August 2023 – and the newest NVv3-series (powered by NVIDIA Tesla M60), NVv4-series (powered by AMD Radeon Instinct MI25) and NCasT4_v3-series (powered by NVIDIA Tesla T4) are used for graphics-intensive applications (e. g. streaming, gaming, encoding) and/or remote visualization workloads. In particular, the NCasT4_v3-series is currently the most performant GPU SKUs in Azure for a game development workstation.

A good practice to find the optimal compute configuration is to run the workload, monitor the results and scale up as needed. There are a wide range of command-line tools available to monitor how your GPU compute is performing. One of the most common command-line tools is the NVIDIA System Management Interface (nvidia-smi), which can run at a defined interval.

TL;DR

If you aren’t looking for a deep dive into how to choose a GPU-optimized VM and you just want to grasp a quick overview of the Azure offer, here is a summary of the VM/GPU type and size you can choose from for your workload.

NC-series

Preferred scenarios: machine-learning and high-performance computing workloads with a low absolute cost per GPU-hour requirements.

VM Type GPU type Deprecation date
NC-series NVIDIA Tesla K80 31/08/2023
NCv2-series NVIDIA Tesla P100 31/08/2023
NCv3-series NVIDIA Tesla V100
NC A100 v4-series NVIDIA A100 PCIe
NCasT4_v3-series NVIDIA Tesla T4

 ND-series

Preferred scenarios: training and inferencing scenarios for deep learning. 

VM Type GPU type Deprecation date
ND-series NVIDIA Tesla P40 31/08/2023
NDv2-series NVIDIA Tesla V100
ND A100 v4-series NVIDIA Ampere A100

 NV-series

Preferred scenarios: remote visualization workloads and other graphics-intensive applications. 

VM Type GPU type Deprecation date
NV-series NVIDIA Tesla M60 31/08/2023
NVv3-series NVIDIA Tesla M60
NVv4-series AMD Radeon Instinct MI25
NVadsA10 v5-series NVIDIA A10

Article originally published here by Carlotta Castelluccio. Reposted with permission.

Carlotta Castelluccio

Carlotta Castelluccio is a Cloud Advocate at Microsoft, focused on AI and ML technologies and passionate about their use in education. As a member of the Developer Relationships Academic team, she works on skilling and engaging educational communities to create and grow with Azure Cloud, by contributing to technical learning content and supporting students and educators in their learning journey with Microsoft technologies. Before joining the Cloud Advocacy team, she have been working as an Azure and AI consultant in the Microsoft Industry Solutions team, mainly involved in customer-face engagements focused on Conversational AI solutions.

1