fbpx
Powering Millions of Real-Time Decisions with LyftLearn Serving Powering Millions of Real-Time Decisions with LyftLearn Serving
Editor’s note: Hakan Baba and Mihir Mathur are speakers for ODSC East 2023 this May. Be sure to check out their... Powering Millions of Real-Time Decisions with LyftLearn Serving

Editor’s note: Hakan Baba and Mihir Mathur are speakers for ODSC East 2023 this May. Be sure to check out their talk, “Powering Millions of Real-time Decisions with Distributed Model Serving,” there!

  • Data Plane: Encompasses steady state entities such as network traffic, CPU/memory consumption, and model inference.
  • Control Plane: Encompasses moving parts such as (un)deployment, model retraining, model naming and versioning, experimentation, backward compatibility, etc.
  1. Variety of user requirements: Different teams care about different system requirements, such as extremely tight latency limits (single-digit millisecond), high throughput (>10⁶ RPS), ability to use niche ML libraries, support for continual learning, etc. This leads to a vast operating environment which is technically challenging to create and maintain.
  2. Constraints imposed by our legacy system. We had a monolithic service that was already in-use for serving models. While it satisfied some of the technical challenges it also imposed several constraints. For instance, the monolithic design restricted the libraries and versions that could be used for different models which led to operational problems like unrelated teams blocking each other from deploying and unclear ownership during incidents.
Diagram of LyftLearn Serving requirements

LyftLearn Serving Requirements. The bar width represents the rough span of the model serving requirements.

LyftLearn Serving — Major Components & Considerations

Microservice Architecture

Diagram of LyftLearn Serving Microservice and its relationship with other tooling

LyftLearn Serving Microservice and its relationship with other tooling

  • HTTP Serving Library: The HTTP server interface is mostly powered by Flask. We have some internal fine-tuning on top of open-source Flask to optimize with consideration for the Envoy load balancer and the underlying Gunicorn web server.
  • Core LyftLearn Serving Library: This library is the crux of the business logic of LyftLearn Serving, housing the various capabilities needed by the customers of the ML platform. This library contains logic for model (un)loading, model versioning, request handling, model shadowing, model monitoring, prediction logging, etc.
  • Custom ML/Predict Code: This is a flexible Python interface fulfilled by ML modelers that enables them to inject any code during the runtime of LyftLearn Serving. This interface surfaces functions like load and predict function, described in more detail in a following section.
  • Third-Party ML Library: The majority of ML models use third-party modeling frameworks such as TensorFlowPyTorchLightGBMXGBoost, or a proprietary framework. LyftLearn Serving does not impose any restriction on the framework as long as there is a Python interface for it.
  • Other components offered by the Lyft microservices architecture: The LyftLearn Serving runtime implements additional interfaces powering metrics, logs, tracing, analytics events, and model monitoring. The LyftLearn Serving runtime sits on top of Lyft’s compute infrastructure which uses the Envoy service mesh and the Kubernetes Scheduler.

Ownership & Isolation

Diagram of isolated components in LyftLearn Serving framework

Isolated Components

Config Generator

LyftLearn Serving Config Generator responsibilities

Config Generator for Creating Service Repositories

Model Self-Tests

class SampleNeuralNetworkModel(TrainableModel):
    @property
    def test_data(self) -> pd.DataFrame:
        return pd.DataFrame(
            [
                # input `[1, 0, 0]` should generate output close to `[1]`
                [[1, 0, 0], 1],
                [[1, 1, 0], 1],
            ],
            columns=["input", "score"],
        )
  1. At runtime in LyftLearn Serving facets: After loading every model, the system evaluates test_data and generates logs and metrics for the ML modelers so they can address any failures.
  2. Anytime a new PR is created: CI evaluates all models loaded in a LyftLearn model repo against the previously stored test data.

LyftLearn Serving Interfaces

def load(self, file: str) -> Any:
 <CUSTOM LOADING CODE HERE>
def predict(self, features: Any) -> Any:
 <CUSTOM PREDICT CODE HERE>

Lifetime of an Inference Request

How an Inference Request is handled by LyftLearn Serving

How an Inference Request is handled by LyftLearn Serving

​​POST /infer
{
  "model_id": "driver_model_v2",
  "features": {
    "feature1": "someValue",
    "feature3": { 
      "a": "a",
      "b": 4.9}
}}
{
  "output": 9.2
}

Development Flow With LyftLearn Serving

Interfaces for Modifying the LyftLearn Serving Runtime

Interfaces for Modifying the LyftLearn Serving Runtime

Summary & Learnings

  • Model serving as a library
  • Distributed serving service ownership
  • Seamless integrations with development environment
  • User-supplied prediction code
  • First-class documentation
  • Define the term “model”. “Model” can refer to a wide variety of things (e.g. the source code, the collection of weights, files in S3, the model binary, etc.), so it’s important to carefully define and document what “model” refers to at the start of almost every conversation. Having a canonical set of definitions in the ML community for all of these different notions of “models” would be immensely helpful.
  • Supply user-facing documentation. For platform products, thorough, clear documentation is critical for adoption. Great documentation leads to teams understanding the systems and self-onboarding effectively, which reduces the platform teams’ support overhead.
  • Expect model serving requests to be used indefinitely. Once a model is serving inference requests behind a network endpoint, it’s likely to be used indefinitely. Therefore, it is important to ensure that the serving system is stable and performs well. Conversely, migrating old models to a new serving system can be incredibly challenging.
  • Prepare to make hard trade-offs. We faced many trade-offs such as building a“Seamless end-to-end UX for new customers” vs. “Composable Intermediary APIs for power customers” or enabling “Bespoke ML workflows for each team” vs. enforcing “Rigor of software engineering best practices”. We made case-by-case decisions based on user behavior and feedback.
  • Make sure your vision is aligned with the power customers. It’s important to align the vision for a new system with the needs of power customers. In our case that meant prioritizing stability, performance, and flexibility above all else. Don’t be afraid to use boring technology.

What’s Next?

Originally published on eng.lyft.com. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1