How to Use Guardrails to Design Safe and Trustworthy AI How to Use Guardrails to Design Safe and Trustworthy AI
If you’re serious about designing, building, or implementing AI, the concept of guardrails is probably something you’ve heard of. While the... How to Use Guardrails to Design Safe and Trustworthy AI

If you’re serious about designing, building, or implementing AI, the concept of guardrails is probably something you’ve heard of. While the concept of guardrails to mitigate AI risks isn’t new, the recent wave of generative AI applications has made these discussions relevant for everyone—not just data engineers and academics.

As an AI builder, it’s critical to educate your stakeholders about the importance of guardrails. As an AI user, you should be asking your vendors the right questions to ensure guardrails are in place when designing ML models for your organization.

In this article, you’ll get a better understanding of guardrails within the context of this post and how to set them at each stage of AI design and development.

What Are Guardrails in AI?

Guardrails are the set of filters, rules, and tools that sit between inputs, the model, and outputs to reduce the likelihood of erroneous/toxic outputs and unexpected formats, while ensuring you’re conforming to your expectations of values and correctness. You can loosely picture them in this diagram.

In short, guardrails are a way to keep a process in line with expectations. They allow us to build more security in our models and give more reliable results to the end user. Today, many guardrails refer to those used with generative AI applications; however, many techniques apply to other AI applications.

Setting Guardrails Throughout AI Design

No matter the application, guardrails can be set at each point in the AI design and development process: in training, for prompts and inputs, and for outputs.

Guardrails During Training

While at ODSC, I heard an interesting quote from Rama Akkiraju, vice president of AI for IT at NVIDIA, that has stuck with me: “We used to get security from obscurity.”

In the past, enterprises have had documents and PDFs buried in drives with protected and sensitive information likely littered throughout. This information used to be secure because organizations weren’t consuming it at scale. Now all of a sudden we’re building language models that might, for example, require a full export of every single customer conversation that’s been had. Chances are, someone has given personal information like a phone number or (hopefully not, but you never know) a Social Security number.

If we export this data without first scanning and identifying what sources of sensitive information might be in the training data, we could pass that information forward to the model. Establishing guardrails for training data gives us the chance to separate any risky information from the initial data.

 The concept of Unit Testing is also well-understood in the world of software development. This involves designing a series of ‘code tests’ that make sure snippets of code, and any updates to them, continue to run as expected. Just like standard Unit Tests, humans still have to come up with the scenarios and examples to test the models against. We’ve even started to see teams get clever with using large language models to generate even more examples that are used in unit tests just like this.

Real-World Example: Let’s pretend you’re a retailer who wants to improve your returns process with a language model-enabled customer service chatbot. Before connecting it to customer purchase records, prior chat histories, and product information, you’ll want to obfuscate the training data. If you’re training the model on your customers and past interactions, ensure that real names or other PII (personal identifiable information) are not passed through to the model.

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.


Guardrails For Prompts and Inputs

When it comes to setting guardrails for prompts and inputs, we can screen data to determine whether the data inputs are likely to make the model misbehave or operate out of known validated conditions.

This is especially important in light of prompt poisoning attempts, a new form of cyber vulnerability where attackers look for specific and weird ways, like introducing odd sequences of tokens, to get LLMs to misbehave.

Through a lot of stress testing and fine-tuning, we can look at the model and functionally determine what makes it behave in a weird way. A simple way to do this is to mathematically calculate how similar (or different) any prompt or input is to any previous examples.

Real-World Example: In our same customer service chatbot example, a customer may start a conversation with the chatbot, requesting to return a specific product. Input guardrails can help determine if the individual requesting information has the right to trigger the model and retrieve that information.

Guardrails For Outputs

These are the set of safeguards that apply to live solutions and sit between an AI model and the end user. When designing guardrails for outputs, determine what causes reputational harm or distrust in the model. It could be off-brand tone, nonfunctional results, biased or harmful language, toxicity, etc. Generally, we’re looking for a few different things at this point:

  • Does the output match the expected output? I.e. if you expect a specific format or response length, or structure?
  • Is the result factually correct? Or in applications that may produce code, can the output actually run?
  • Does the output contain any harmful biases? Is the tone safe and appropriate for the intended audience?
  • Does the user have the right to access and know all information that is included in the outputs?

These guardrails are critical to prevent low-quality or potentially harmful results from making their way to a user. It’s better to default to ‘I can’t answer that’ or a set of pre-populated responses to direct further action than to provide incorrect output.

Real-World Example: When a customer asks your chatbot for a refund, is there a set rule on the maximum amount that someone can refund in a single transaction? This sort of rule is one instance of a guardrail on outputs. Another example would be to set a filter so that all outputs meet a certain level of positive sentiment to match your brand voice.

Guardrails in AI may not be new, but now is the time to familiarize yourself with them. As an AI builder, how are you ensuring your ML models have the right filters and rules in place to avoid unintended consequences? And as an AI user, are you working with vendors you can trust to build a model with proper guardrails? Documenting these assumptions and clearly communicating them to end users builds invaluable trust in the model across users and stakeholders.

About the Author: Cal Al-Dhubaib is a globally recognized data scientist and AI strategist in trustworthy artificial intelligence, as well as the founder and CEO of Pandata, a Cleveland-based AI consultancy, design, and development firm.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.