Training with PyTorch on Amazon SageMaker Training with PyTorch on Amazon SageMaker
PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs... Training with PyTorch on Amazon SageMaker

PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs on Amazon SageMaker. I’ll show you how to:

  • build a custom Docker container for CPU and GPU training,
  • pass parameters to a PyTorch script,
  • save the trained model.

As usual, you’ll find my code on Github 🙂

Uruks train PyTorch on SageMaker. That’s a fact.

Building a custom container

SageMaker provides a collection of built-in algorithms as well as environments for TensorFlow and MXNet… but not for PyTorch. Fortunately, developers have the option to build custom containers for training and prediction.

Obviously, a number of conventions need to be defined for SageMaker to successfully invoke a custom container:

  • Name of the training and prediction scripts: by default, they should respectively be set to ‘train’ and ‘serve’, be executable and have no extension. SageMaker will start training by running ‘docker run your_container train’.
  • Location of hyper parameters in the container: /opt/ml/input/config/hyperparameters.json.
  • Location of input data parameters in the container: /opt/ml/input/data.

This will require some changes in our PyTorch script, the well-known example of learning MNIST with a simple CNN. As you will see in a moment, they are quite minor and you won’t have any trouble adding them to your own code.

Building a Docker container

Here’s the Docker file.

FROM nvidia/cuda:9.0-runtime
RUN apt-get update && \
apt-get -y install build-essential python-dev python3-dev python3-pip python-imaging wget curl
COPY mnist_cnn.py /opt/program/train
RUN chmod +x /opt/program/train
RUN pip3 install http://download.pytorch.org/whl/cu90/torch-0.4.0-cp35-cp35m-linux_x86_64.whl –upgrade && \
pip3 install torchvision –upgrade
RUN rm -rf /var/lib/apt/lists/*
RUN rm -rf /root/.cache
ENV PATH=”/opt/program:${PATH}”
WORKDIR /opt/program
We start from the CUDA 9.0 image, which is also based on Ubuntu 16.04. This one has all the CUDA libraries that PyTorch needs. We then add Python 3 and the PyTorch packages.

Unlike MXNet, PyTorch comes in a single package that support both CPU and GPU training.

Once this is done, we clean up various caches to shrink the container size a bit. Then, we copy the PyTorch script to /opt/program with the proper name (‘train’) and we make it executable.

For more flexibility, we could write a generic launcher that would fetch the actual training script from an S3 location passed as an hyper parameter. This is left as an exercise for the reader 😉

Finally, we set the directory of our script as the work directory and add it to the path.

It’s not a long file, but as usual with these things, every detail counts.

Creating a Docker repository in Amazon ECR

SageMaker requires that the containers it fetches are hosted in Amazon ECR. Let’s create a repo and login to it.

aws ecr describe-repositories –repository-names $repo_name > /dev/null 2>&1
if [ $? -ne 0 ]
aws ecr create-repository –repository-name $repo_name > /dev/null
$(aws ecr get-login –region $region –no-include-email)

Building and pushing our containers to ECR

OK, now it’s time to build both containers and push them to their repos. We’ll do this separately for the CPU and GPU versions. Strictly Docker stuff. Please refer to the notebook for details on variables.

docker build -t $image_tag -f $dockerfile .
docker tag $image_tag $account.dkr.ecr.$region.amazonaws.com/$repo_name:latest
docker push $account.dkr.ecr.$region.amazonaws.com/$repo_name:latest
Once we’re done, things should look like this and you should also see your container in ECR.

The Docker part is over. Now let’s configure our training job in SageMaker.

Configuring the training job

This is actually quite underwhelming, which is great news: nothing really differs from training with a built-in algorithm!

First we need to upload the MNIST data set from our local machine to S3. We’ve done this many times before, nothing new here.

local_directory = data
prefix = repo_name+/input
train_input_path = sess.upload_data(
local_directory+/train/, key_prefix=prefix+/train)
validation_input_path = sess.upload_data(
local_directory+/validation/, key_prefix=prefix+/validation)
Then, we configure the training job by:
  • selecting one of the containers we just built and setting the usual parameters for SageMaker estimators,
  • passing hyper parameters to the PyTorch script.
  • passing input data to the PyTorch script.

Unlike Keras, PyTorch has APIs to check if CUDA is available and to detect how many GPUs are available. Thus, we don’t need to pass this information in an hyper parameter. Multi-GPU training is also possible but requires extra work: MXNet makes it much simpler.

output_path = s3://{}/{}/output.format(sess.default_bucket(), repo_name)
image_name = {}.dkr.ecr.{}.amazonaws.com/{}:latest.format(account, region, repo_name)
estimator = sagemaker.estimator.Estimator(
estimator.set_hyperparameters(lr=0.01, epochs=10, batch_size=batch_size)
estimator.fit({training: train_input_path, validation: validation_input_path})
That’s it for training. The last part we’re missing is adapting our PyTorch script for SageMaker. Let’s get to it.

Adapting the PyTorch script for SageMaker

We need to take care of hyper parameters, input data, multi-GPU configuration, loading the data set and saving models.

Passing hyper parameters and input data configuration

As mentioned earlier, SageMaker copies hyper parameters to /opt/ml/input/config/hyperparameters.json. All we have to do is read this file, extract parameters and set default values if needed.

# SageMaker paths
prefix = /opt/ml/
param_path = os.path.join(prefix, input/config/hyperparameters.json)
data_path = os.path.join(prefix, input/config/inputdataconfig.json)
# Read hyper parameters passed by SageMaker
with open(param_path, r) as params:
hyperParams = json.load(params)
lr = float(hyperParams.get(lr, 0.1))
batch_size = int(hyperParams.get(batch_size, 128))
epochs = int(hyperParams.get(epochs, 10))
# Read input data config passed by SageMaker
with open(data_path, r) as params:
inputParams = json.load(params)

In a similar fashion, SageMaker copies the input data configuration to /opt/ml/input/data. We’ll handle things in exactly the same way.

In this example, I don’t need this configuration info, but this is how you’d read it if you did 🙂

Loading the training and validation set

When training in file mode (which is the case here), SageMaker automatically copies the data set to /opt/ml/input/<channel_name>: here, we defined the train and validation channels, so we’ll have to:

  • read the MNIST files from the corresponding directories,
  • build DataSet objects for the training and validation set,
  • load them using the DataLoader object.
# SageMaker paths
prefix = /opt/ml/
input_path = os.path.join(prefix, input/data/)
# Adapted from https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py
class MyMNIST(data.Dataset):
def __init__(self, train=True, transform=None, target_transform=None):
self.transform = transform
self.target_transform = target_transform
self.train = train # training set or test set
# Loading local MNIST files in PyTorch format: training.pt and test.pt.
if self.train:
self.train_data, self.train_labels =
self.test_data, self.test_labels =
def __getitem__(self, index):
if self.train:
img, target = self.train_data[index], self.train_labels[index]
img, target = self.test_data[index], self.test_labels[index]
# doing this so that it is consistent with all other datasets
# to return a PIL Image
img = Image.fromarray(img.numpy(), mode=L)
if self.transform is not None:
img = self.transform(img)
if self.target_transform is not None:
target = self.target_transform(target)
return img, target
def __len__(self):
if self.train:
return len(self.train_data)
return len(self.test_data)
train_loader = torch.utils.data.DataLoader(
transforms.Normalize((0.1307,), (0.3081,))
batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
transforms.Normalize((0.1307,), (0.3081,))
batch_size=test_batch_size, shuffle=True, **kwargs)

Saving the model

The very last thing we need to do once training is complete is to save the model in /opt/ml/model: SageMaker will grab all artefacts present in this directory, build a file called model.tar.gz and copy it to the S3 bucket used by the training job.

torch.save(model, model_path+mnist-cnn-+str(epochs)+.pt)

That’s it. As you can see, it’s all about interfacing your script with SageMaker input and output. The bulk of your PyTorch code doesn’t require any modification.

Running the script

Alright, let’s run this on a p3.2xlarge instance.

Hyper parameters: {‘epochs’: ’10’, ‘lr’: ‘0.01’, ‘batch_size’: ‘128’}
Input parameters: {‘training’: {‘RecordWrapperType’: ‘None’, ‘TrainingInputMode’: ‘File’, ‘S3DistributionType’: ‘FullyReplicated’}, ‘validation’: {‘RecordWrapperType’: ‘None’, ‘TrainingInputMode’: ‘File’, ‘S3DistributionType’: ‘FullyReplicated’}}
Train Epoch: 1 [0/60000 (0%)] Loss: 2.349514
Train Epoch: 1 [1280/60000 (2%)] Loss: 2.296775
Train Epoch: 1 [2560/60000 (4%)] Loss: 2.258955
Train Epoch: 1 [3840/60000 (6%)] Loss: 2.243712
Train Epoch: 1 [5120/60000 (9%)] Loss: 2.108034
Train Epoch: 1 [6400/60000 (11%)] Loss: 1.979539
Train Epoch: 10 [56320/60000 (94%)] Loss: 0.178176
Train Epoch: 10 [57600/60000 (96%)] Loss: 0.109542
Train Epoch: 10 [58880/60000 (98%)] Loss: 0.151139
Test set: Average loss: 0.0450, Accuracy: 9864/10000 (99%)
===== Job Complete =====
Billable seconds: 174
Let’s check the S3 bucket.
$ aws s3 ls $BUCKET/pytorch/output/pytorch-mnist-cnn-2018-06-02-08-16-11-355/output/
2018-06-02 08:20:28      86507 model.tar.gz
$ aws s3 cp $BUCKET/pytorch/output/pytorch-mnist-cnn-2018-06-02-08-16-11-355/output/ .
$ tar tvfz model.tar.gz
-rw-r--r-- 0/0   99436 2018-06-02 08:20 mnist-cnn-10.pt

Pretty cool, right? We can now use this model anywhere we like.

That’s it for today. Another (hopefully) nice example of using SageMaker to train your custom jobs on fully-managed infrastructure!

Happy to answer questions here or on Twitter. For more content, please feel free to check out my YouTube channel.


Original Source

Julien Simon

Julien Simon

Julien focuses on helping developers and organizations bring their ideas to life. He frequently speaks at conferences and he also blogs at https://medium.com/@julsimon. Prior to joining AWS, Julien served for 10 years as CTO/VP Engineering in top-tier web startups where he led large Software and Ops teams in charge of thousands of servers worldwide. In the process, he fought his way through a wide range of technical, business and procurement issues, which helped him gain a deep understanding of physical infrastructure, its limitations and how cloud computing can help. Last but not least, Julien holds all eight AWS certifications.