Mastering the Mystical Art of Model Deployment Mastering the Mystical Art of Model Deployment
With all the talk about algorithm selection, hyper parameter optimization and so on, you could think that training models is the... Mastering the Mystical Art of Model Deployment

With all the talk about algorithm selection, hyper parameter optimization and so on, you could think that training models is the hardest part of the Machine Learning process. However, in my experience, the really tricky step is to deploy these models safely in a web production environment.

In this post, I’ll first talk about the typical tasks required to deploy and validate models in production. Then, I’ll present several model deployment techniques and how to implement them with Amazon SageMaker. In particular, I’ll show you in detail how to host multiple models on the same prediction endpoint, an important technique to minimize deployment risks.

                                        That guy played Alan Turing and Dr Strange. ‘nuff said.

Validating a model

Even if you’ve carefully trained and evaluated a model in your Data Science sandbox, additional work is required to check that it will work correctly in your production environment. This usually involve tasks like:

  • setting up a monitoring system to store and visualize model metrics,
  • building a test web application bundling the model and running technical tests (is the model fast? how much RAM does it require? etc.) as well as prediction tests (is my model still predicting as expected?).
  • integrating the model with your business application and running end to end tests,
  • deploying the application in production using techniques like blue-green deployment or canary testing (more on this in a minute),
  • running different versions of the same model in parallel for longer periods of time, in order to measure their long-term effectiveness with respect to business metrics (aka A/B testing).

Quite a bit of work, then. Let’s first look at the different ways we could deploy models.

Deployment options

Standard deployment

In its simplest form, deploying a model usually involves building a bespoke web application hosting your model and receiving prediction requests. Testing is what you would expect: sending HTTP requests, checking logs and checking metrics.

SageMaker greatly simplifies this process. With just a few lines of code, the Estimator object in the SageMaker SDK (or its subclasses for built-in algos, TensorFlow, etc.) lets you deploy a model to an HTTPS endpoint and run prediction tests. No need to write any app. In addition, technical metrics are available out of the box in CloudWatch.

If you’re interested in load testing and sizing endpoints, this nice AWS blog post will show you how to do it.

I won’t dwell on this: I’ve covered it several times in previous posts and you’ll also find plenty of examples in the SageMaker notebook collection.

Blue-green deployment

This proven deployment technique requires two identical environments:

  • the live production environment (“blue”) running version n,
  • an exact copy of this environment (“green”) running version n+1.

First, you run tests on the green environment, monitor technical and business metrics and check that everything is correct. If it is, you can then switch traffic to the green environment… and check again. If something goes wrong, you can immediately switch back to the blue environment and investigate. If everything is fine, you can delete the blue environment.

To make this process completely transparent to client applications, a middleman — located between the clients and the environments — is in charge of implementing the switch: popular choices include load balancers, DNS, etc. This is what it looks like.

                                                                   Blue-green deployment

Blue-green deployment, the SageMaker way

The AWS SDK for SageMaker provides the middleman that we need in the form of the endpoint configuration. This resource lets us attach several models to the same endpoint, with different weights and different instance configurations (aka production variants). The setup may be updated at any time during the life of the endpoint.

In fact, one could picture an endpoint as a special type of load balancer, using weighted round robin to send prediction requests to instance pools hosting different models. Here’s the information required to set one up with the CreateEndpointConfig API.

"ProductionVariants": [ 
         "InitialInstanceCount": number,
         "InitialVariantWeight": number,
         "InstanceType": "string",
         "ModelName": "string",
         "VariantName": "string"

Implementing blue-green deployment now goes like this:

  • create a new endpoint configuration, holding the production variantsfor the existing live model and for the new model.
  • update the existing live endpoint with the new endpoint configuration (UpdateEndpoint API): SageMaker creates the required infrastructure for the new production variant and update weights without any downtime.
  • switch traffic to the new model (UpdateEndpointWeightAndCapacitiesAPI),
  • create a new endpoint configuration holding only the new production variant and apply it to the endpoint: SageMaker terminates the infrastructure for the previous production variant.

This is what it looks like.

                                      Blue-green deployment with a single SageMaker endpoint

Canary testing

Canary testing lets you validate a new release with minimal risk by deploying it first for a fraction of your users: everyone else keeps using the previous version. This user split can be done in many ways: random, geolocation, specific user lists, etc. Once you’re satisfied with the release, you can gradually roll it out to all users.

This requires “stickiness”: for the duration of the test, designated users must be routed to servers running the new release. This could be achieved by setting a specific cookie for these users, allowing the web application to identify them and send their traffic to the proper servers.

You could implement this logic either in the application itself or in a dedicated web service. The latter would be in charge of receiving prediction requests and invoking the appropriate endpoint.

This feels like extra work, but chances are you’ll need a web service anyway for data pre-processing (normalization, injecting extra data in the prediction request, etc.) and post-processing (filtering prediction results, logging, etc.). Lambda feels like a good way to do this: easy to deploy, easy to scale, built-in high-availability, etc.: here’s an example implemented with AWS Chalice.

This what it would look like with two endpoints.

                Using a “switch” web service and two single-model endpoints for canary testing.


Once we’re happy that the new model works, we can gradually roll it out to all users, scaling endpoints up and down accordingly.

                                               Gradually switching all users to the new models.

A/B testing

A/B testing is about comparing the performance of different versions of the same feature while monitoring a high-level metric (e.g. click-through rate, conversion rate, etc.). In this context, this would mean predicting with different models for different users and analysing results.

Technically speaking, A/B testing is similar to canary testing with larger user groups and a longer time-scale (days or even weeks). Stickiness is essential and the technique mentioned above would certainly work: building user buckets, sticking them to different endpoints and logging results.

As you can see, the ability to deploy multiple models to the same endpoint is an important requirement for validation and testing. Let’s see how this works.

Deploying multiple models to the same endpoint

Imagine we’d like to compare different models trained with the built-in algorithm for image classification using different hyper-parameters.

These are the steps we need to take (full notebook available on Gitlab):

  1. train model A.
  2. train model B.
  3. create models A and B, i.e. registering them in SageMaker.
  4. create an endpoint configuration with the two production variants.
  5. create the endpoint.
  6. send traffic and look at CloudWatch metrics.

We’ve trained this algo in a previous post, so I won’t go into details: in a nutshell, we’re simply training two models with different learning rates.

Creating the endpoint configuration

This is where we define our two production variants: one for model A and one for model B. To begin with, we’ll assign them equal weights, in order to balance traffic 50/50. We’ll also use identical instance configurations.

job_name_prefix = DEMO-imageclassification
timestamp = time.strftime(-%Y-%m-%d-%H-%M-%S, time.gmtime())
endpoint_config_name = job_name_prefix + -epc- + timestamp
endpoint_config_response = sagemaker.create_endpoint_config(
EndpointConfigName = endpoint_config_name,
view rawsmdeploy-1.py hosted with ❤ by GitHub

Creating the endpoint

Pretty straightforward: all it takes is calling the CreateEndpoint API, which builds all infrastructure required to support the production variants defined in the endpoint configuration.

timestamp = time.strftime(-%Y-%m-%d-%H-%M-%S, time.gmtime())

endpoint_name = job_name_prefix + -ep- + timestamp

print(Endpoint name: {}.format(endpoint_name))
endpoint_params = {
EndpointName: endpoint_name,
EndpointConfigName: endpoint_config_name,
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
view rawsmdeploy-2.py hosted with ❤ by GitHub


After a few minutes, we can see the endpoint settings in the SageMaker console.

Monitoring traffic

Let’s send some traffic and monitor the endpoint in CloudWatch. After a few more minutes, we can see that traffic is nicely balanced between the two production variants.


Let’s update the weights in the AWS console: model A now gets 10% of traffic and model B gets 90%. As mentioned above, you could also do this programmaticall with the UpdateEndpointWeightAndCapacities API.


Almost immediately, we see most of the traffic now going model B.

Wrapping up

As you can see, it’s pretty easy to manage multiple models on the same prediction endpoint. This lets use different techniques to safely test new models before deploying them with minimal risk to client applications 🙂

That’s it for today. Thank you for reading. As always, please feel free to ask your questions here or on Twitter.


Original Source

Julien Simon

Julien Simon

Julien focuses on helping developers and organizations bring their ideas to life. He frequently speaks at conferences and he also blogs at https://medium.com/@julsimon. Prior to joining AWS, Julien served for 10 years as CTO/VP Engineering in top-tier web startups where he led large Software and Ops teams in charge of thousands of servers worldwide. In the process, he fought his way through a wide range of technical, business and procurement issues, which helped him gain a deep understanding of physical infrastructure, its limitations and how cloud computing can help. Last but not least, Julien holds all eight AWS certifications.