

Training with PyTorch on Amazon SageMaker
Tools & LanguagesWorkflowposted by Julien Simon July 19, 2018 Julien Simon

PyTorch is a flexible open source framework for Deep Learning experimentation. In this post, you will learn how to train PyTorch jobs on Amazon SageMaker. I’ll show you how to:
- build a custom Docker container for CPU and GPU training,
- pass parameters to a PyTorch script,
- save the trained model.
As usual, you’ll find my code on Github 🙂
Uruks train PyTorch on SageMaker. That’s a fact.
Building a custom container
SageMaker provides a collection of built-in algorithms as well as environments for TensorFlow and MXNet… but not for PyTorch. Fortunately, developers have the option to build custom containers for training and prediction.
Obviously, a number of conventions need to be defined for SageMaker to successfully invoke a custom container:
- Name of the training and prediction scripts: by default, they should respectively be set to ‘train’ and ‘serve’, be executable and have no extension. SageMaker will start training by running ‘docker run your_container train’.
- Location of hyper parameters in the container: /opt/ml/input/config/hyperparameters.json.
- Location of input data parameters in the container: /opt/ml/input/data.
This will require some changes in our PyTorch script, the well-known example of learning MNIST with a simple CNN. As you will see in a moment, they are quite minor and you won’t have any trouble adding them to your own code.
Building a Docker container
Here’s the Docker file.
FROM nvidia/cuda:9.0-runtime | |
RUN apt-get update && \ | |
apt-get -y install build-essential python-dev python3-dev python3-pip python-imaging wget curl | |
COPY mnist_cnn.py /opt/program/train | |
RUN chmod +x /opt/program/train | |
RUN pip3 install http://download.pytorch.org/whl/cu90/torch-0.4.0-cp35-cp35m-linux_x86_64.whl –upgrade && \ | |
pip3 install torchvision –upgrade | |
RUN rm -rf /var/lib/apt/lists/* | |
RUN rm -rf /root/.cache | |
ENV PYTHONDONTWRITEBYTECODE=1 \ | |
PYTHONUNBUFFERED=1 \ | |
LD_LIBRARY_PATH=”${LD_LIBRARY_PATH}:/usr/local/lib” | |
ENV PATH=”/opt/program:${PATH}” | |
WORKDIR /opt/program | |
Unlike MXNet, PyTorch comes in a single package that support both CPU and GPU training.
Once this is done, we clean up various caches to shrink the container size a bit. Then, we copy the PyTorch script to /opt/program with the proper name (‘train’) and we make it executable.
For more flexibility, we could write a generic launcher that would fetch the actual training script from an S3 location passed as an hyper parameter. This is left as an exercise for the reader 😉
Finally, we set the directory of our script as the work directory and add it to the path.
It’s not a long file, but as usual with these things, every detail counts.
Creating a Docker repository in Amazon ECR
SageMaker requires that the containers it fetches are hosted in Amazon ECR. Let’s create a repo and login to it.
aws ecr describe-repositories –repository-names $repo_name > /dev/null 2>&1 | |
if [ $? -ne 0 ] | |
then | |
aws ecr create-repository –repository-name $repo_name > /dev/null | |
fi | |
$(aws ecr get-login –region $region –no-include-email) |
Building and pushing our containers to ECR
OK, now it’s time to build both containers and push them to their repos. We’ll do this separately for the CPU and GPU versions. Strictly Docker stuff. Please refer to the notebook for details on variables.
docker build -t $image_tag -f $dockerfile . | |
docker tag $image_tag $account.dkr.ecr.$region.amazonaws.com/$repo_name:latest | |
docker push $account.dkr.ecr.$region.amazonaws.com/$repo_name:latest |

The Docker part is over. Now let’s configure our training job in SageMaker.
Configuring the training job
This is actually quite underwhelming, which is great news: nothing really differs from training with a built-in algorithm!
First we need to upload the MNIST data set from our local machine to S3. We’ve done this many times before, nothing new here.
local_directory = ‘data‘ | |
prefix = repo_name+‘/input‘ | |
train_input_path = sess.upload_data( | |
local_directory+‘/train/‘, key_prefix=prefix+‘/train‘) | |
validation_input_path = sess.upload_data( | |
local_directory+‘/validation/‘, key_prefix=prefix+‘/validation‘) |
- selecting one of the containers we just built and setting the usual parameters for SageMaker estimators,
- passing hyper parameters to the PyTorch script.
- passing input data to the PyTorch script.
Unlike Keras, PyTorch has APIs to check if CUDA is available and to detect how many GPUs are available. Thus, we don’t need to pass this information in an hyper parameter. Multi-GPU training is also possible but requires extra work: MXNet makes it much simpler.
output_path = ‘s3://{}/{}/output‘.format(sess.default_bucket(), repo_name) | |
image_name = ‘{}.dkr.ecr.{}.amazonaws.com/{}:latest‘.format(account, region, repo_name) | |
print(output_path) | |
print(image_name) | |
estimator = sagemaker.estimator.Estimator( | |
image_name=image_name, | |
base_job_name=base_job_name, | |
role=role, | |
train_instance_count=1, | |
train_instance_type=train_instance_type, | |
output_path=output_path, | |
sagemaker_session=sess) | |
estimator.set_hyperparameters(lr=0.01, epochs=10, batch_size=batch_size) | |
estimator.fit({‘training‘: train_input_path, ‘validation‘: validation_input_path}) |
Adapting the PyTorch script for SageMaker
We need to take care of hyper parameters, input data, multi-GPU configuration, loading the data set and saving models.
Passing hyper parameters and input data configuration
As mentioned earlier, SageMaker copies hyper parameters to /opt/ml/input/config/hyperparameters.json. All we have to do is read this file, extract parameters and set default values if needed.
# SageMaker paths | |
prefix = ‘/opt/ml/‘ | |
param_path = os.path.join(prefix, ‘input/config/hyperparameters.json‘) | |
data_path = os.path.join(prefix, ‘input/config/inputdataconfig.json‘) | |
# Read hyper parameters passed by SageMaker | |
with open(param_path, ‘r‘) as params: | |
hyperParams = json.load(params) | |
lr = float(hyperParams.get(‘lr‘, ‘0.1‘)) | |
batch_size = int(hyperParams.get(‘batch_size‘, ‘128‘)) | |
epochs = int(hyperParams.get(‘epochs‘, ‘10‘)) | |
# Read input data config passed by SageMaker | |
with open(data_path, ‘r‘) as params: | |
inputParams = json.load(params) |
In a similar fashion, SageMaker copies the input data configuration to /opt/ml/input/data. We’ll handle things in exactly the same way.
In this example, I don’t need this configuration info, but this is how you’d read it if you did 🙂
Loading the training and validation set
When training in file mode (which is the case here), SageMaker automatically copies the data set to /opt/ml/input/<channel_name>: here, we defined the train and validation channels, so we’ll have to:
- read the MNIST files from the corresponding directories,
- build DataSet objects for the training and validation set,
- load them using the DataLoader object.
# SageMaker paths | |
prefix = ‘/opt/ml/‘ | |
input_path = os.path.join(prefix, ‘input/data/‘) | |
# Adapted from https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py | |
class MyMNIST(data.Dataset): | |
def __init__(self, train=True, transform=None, target_transform=None): | |
self.transform = transform | |
self.target_transform = target_transform | |
self.train = train # training set or test set | |
# Loading local MNIST files in PyTorch format: training.pt and test.pt. | |
if self.train: | |
self.train_data, self.train_labels = | |
torch.load(os.path.join(input_path,‘training/training.pt‘)) | |
else: | |
self.test_data, self.test_labels = | |
torch.load(os.path.join(input_path,‘validation/test.pt‘)) | |
def __getitem__(self, index): | |
if self.train: | |
img, target = self.train_data[index], self.train_labels[index] | |
else: | |
img, target = self.test_data[index], self.test_labels[index] | |
# doing this so that it is consistent with all other datasets | |
# to return a PIL Image | |
img = Image.fromarray(img.numpy(), mode=‘L‘) | |
if self.transform is not None: | |
img = self.transform(img) | |
if self.target_transform is not None: | |
target = self.target_transform(target) | |
return img, target | |
def __len__(self): | |
if self.train: | |
return len(self.train_data) | |
else: | |
return len(self.test_data) | |
… | |
train_loader = torch.utils.data.DataLoader( | |
MyMNIST(train=True, | |
transform=transforms.Compose([ | |
transforms.ToTensor(), | |
transforms.Normalize((0.1307,), (0.3081,)) | |
])), | |
batch_size=batch_size, shuffle=True, **kwargs) | |
test_loader = torch.utils.data.DataLoader( | |
MyMNIST(train=False, | |
transform=transforms.Compose([ | |
transforms.ToTensor(), | |
transforms.Normalize((0.1307,), (0.3081,)) | |
])), | |
batch_size=test_batch_size, shuffle=True, **kwargs) |
Saving the model
The very last thing we need to do once training is complete is to save the model in /opt/ml/model: SageMaker will grab all artefacts present in this directory, build a file called model.tar.gz and copy it to the S3 bucket used by the training job.
torch.save(model, model_path+‘mnist-cnn-‘+str(epochs)+‘.pt‘) |
That’s it. As you can see, it’s all about interfacing your script with SageMaker input and output. The bulk of your PyTorch code doesn’t require any modification.
Running the script
Alright, let’s run this on a p3.2xlarge instance.
Hyper parameters: {‘epochs’: ’10’, ‘lr’: ‘0.01’, ‘batch_size’: ‘128’} | |
Input parameters: {‘training’: {‘RecordWrapperType’: ‘None’, ‘TrainingInputMode’: ‘File’, ‘S3DistributionType’: ‘FullyReplicated’}, ‘validation’: {‘RecordWrapperType’: ‘None’, ‘TrainingInputMode’: ‘File’, ‘S3DistributionType’: ‘FullyReplicated’}} | |
Train Epoch: 1 [0/60000 (0%)] Loss: 2.349514 | |
Train Epoch: 1 [1280/60000 (2%)] Loss: 2.296775 | |
Train Epoch: 1 [2560/60000 (4%)] Loss: 2.258955 | |
Train Epoch: 1 [3840/60000 (6%)] Loss: 2.243712 | |
Train Epoch: 1 [5120/60000 (9%)] Loss: 2.108034 | |
Train Epoch: 1 [6400/60000 (11%)] Loss: 1.979539 | |
… | |
Train Epoch: 10 [56320/60000 (94%)] Loss: 0.178176 | |
Train Epoch: 10 [57600/60000 (96%)] Loss: 0.109542 | |
Train Epoch: 10 [58880/60000 (98%)] Loss: 0.151139 | |
Test set: Average loss: 0.0450, Accuracy: 9864/10000 (99%) | |
===== Job Complete ===== | |
Billable seconds: 174 |
$ aws s3 ls $BUCKET/pytorch/output/pytorch-mnist-cnn-2018-06-02-08-16-11-355/output/ 2018-06-02 08:20:28 86507 model.tar.gz $ aws s3 cp $BUCKET/pytorch/output/pytorch-mnist-cnn-2018-06-02-08-16-11-355/output/ . $ tar tvfz model.tar.gz -rw-r--r-- 0/0 99436 2018-06-02 08:20 mnist-cnn-10.pt
Pretty cool, right? We can now use this model anywhere we like.
That’s it for today. Another (hopefully) nice example of using SageMaker to train your custom jobs on fully-managed infrastructure!
Happy to answer questions here or on Twitter. For more content, please feel free to check out my YouTube channel.
Original Source