Build and Deploy Multiple Large Language Models in Kubernetes with LangChain Build and Deploy Multiple Large Language Models in Kubernetes with LangChain
Editor’s note: Ezequiel Lanza is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “LangChain... Build and Deploy Multiple Large Language Models in Kubernetes with LangChain

Editor’s note: Ezequiel Lanza is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “LangChain on Kubernetes: Cloud-Native LLM Deployment Made Easy & Efficient,” there!

Deploying large language model (LLM) architectures with billions of parameters is like preparing your dinner with an extensive list of ingredients—each component adds something important to the final product and you need to get the mix just right to get the results you want. Crafting generative AI interfaces is already challenging; you have tools like Hugging Face, LangChain, PyTorch, TensorFlow, and many more at your disposal, each offering its own set of capabilities and intricacies with no one recipe to guide you. And, with everyone from IT to marketing wanting a piece of the AI pie, the pressure is on to get results that taste good right out of the oven, so to speak.  

Imagine you’re planning to implement an internal chatbot capable of responding to HR, IT, and legal department inquiries. Initially, you must determine which LLM to utilize. If you try to rely on one model to handle all queries, for example, you might find that a specialized HR bot isn’t equipped to handle legal issues effectively or that a generic model isn’t able to answer questions about internal documentation. 

This prompts consideration of an architecture utilizing multiple LLMs to meet various departmental needs, requiring a mix of fine-tuned, generic, or externally sourced models, all accessible through a unified user interface. Managing this approach involves balancing computational requirements, optimizing resource utilization, and navigating potential pitfalls. Just as preparing a fine meal requires patience and understanding, deploying LLM architectures demands similar attention to detail, and a deep understanding of the components involved.

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.


Equipping the Cloud Native Kitchen

Similar to how a well-equipped kitchen is essential for cooking a great meal, deploying our multi-department bot requires finding the right infrastructure. You’ll start by deciding which LLM to use—a local or an external model? 

In cases where local regulations or privacy concerns dictate the use of a local model, you’ll want to avoid relying on external APIS to safeguard data integrity. However, deploying LLMs locally presents its own set of challenges. For instance, consider the storage demands of a model like LLaMa2-7B-chat-hf, which can require approximately 26GB of RAM. You’ll want to optimize the model to reduce its size, from, say, 26GB to 7GB for a model like LLaMa2-7B-chat-hf, which you can do with tools like Intel Extensions for Transformers ,This optimization is essential for ensuring efficient deployment and operation, especially when transitioning from development in a Jupyter notebook to a real-life environment. 

In this transition, you should now consider factors such as the traffic that each of our multi-model bots will receive. This includes determining the appropriate size of the environment in which each model will operate, how it will adapt to fluctuations in the traffic, and how it will scale to accommodate changes, just to mention some factors.

Cloud-native architecture works well in this scenario, as it isolates each part of the application independently. Typically, Kubernetes serves as the orchestrator for deploying the application. This approach enables seamless scalability through automatic resource adjustment, ensuring optimal performance regardless of demand fluctuations. Additionally, modularization simplifies management, allowing for easier monitoring and minimizing the risk of system-wide failures. For multi-model architecture, like our case, this modularity offers granular control, preventing failures in one container from affecting others and maintaining overall system stability.


LangChain Makes it Easy

LangChain simplifies the process of communication between those isolated entities, in our case each model for each department. Since each model is independent and runs in a container, we need to establish a communication method, particularly with the front end, which receives user questions and forwards them to each model to answer. LangChain facilitates this by offering a unified way to interact with each model, whether it’s consumed by an external API, a local model, or a more advanced strategy like retrieval augmented generation (RAG).A simple method within LangChain, called “chain.invoke,” combines the model and the prompt that the user inputs. However, this functionality extends beyond mere code unification; it also facilitates container exposure. For instance, through LangServe’s “add_route” feature integrated with FastAPI, we can selectively expose the method, such as “.invoke,” via an external API.

Putting Together the Architecture

Let’s outline the architecture by breaking it down into three main components:

  • Front-end: This component serves as the interface through which users interact with the application. There are various open source front-end solutions available, such as radio and streamlet, or you can use a great REACT open-source toolkit: react-chatbot-kit
  • Frontend_LLMs: This is a front-end module responsible for forwarding requests to each LLM backend. When utilizing local models, each LLM container needs to be instantiated separately; this approach provides better control over resources by treating each container as a distinct entity.
  • Back-end LLMs: This component hosts the individual LLMs. Each LLM container receives incoming requests and responds by sending the processed data back to the front-end_LLM module. For external models, the communication process is similar, with the only difference being the endpoint location.

Image 1: A reference architecture for multiple LLMs

It’s evident that deploying multiple LLMs in Kubernetes demands meticulous planning. Thankfully, tools like LangChain and Hugging Face streamline this process, making it more manageable.

Now, transitioning to the advantages of a cloud-native architecture, it’s clear that it offers scalability and enhanced management capabilities. With its modularity, elasticity, and fine-grained control, resource management becomes efficient. This architecture facilitates seamless adaptation to fluctuating demands, ensuring optimal performance across diverse language models. Moreover, it simplifies resource allocation, enabling uninterrupted service delivery. Ultimately, it empowers you to consistently provide satisfying outcomes, ensuring your users receive a reliable service experience.

In my upcoming talk at ODSC East 2024, I’ll delve into a real-world example, providing a step-by-step guide on how to select a model, optimize it, and deploy the architecture in Kubernetes.

For more insightful content on open source and AI, be sure to check out Open.intel.

About the Author:

Ezequiel Lanza is an AI open source evangelist at Intel. He holds an M.S. in Data Science, and he’s passionate about helping people discover the exciting world of artificial intelligence; Ezequiel is a frequent AI conference presenter and the creator of use cases, tutorials, and guides that help developers adopt open-source AI tools.

Cover Photo by Nadine Primeau on Unsplash

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.