The Three Essential Methods to Evaluate a New Language Model The Three Essential Methods to Evaluate a New Language Model
New LLMs are released every week, and if you’re like me, you might ask yourself: Does this one finally fit all... The Three Essential Methods to Evaluate a New Language Model

New LLMs are released every week, and if you’re like me, you might ask yourself: Does this one finally fit all the use cases I want to utilise an LLM for? In this tutorial, I will share the techniques that I use to evaluate new LLMs. I’ll introduce three techniques I use regularly — none of them are new (in fact, I will refer to blog posts that I have written previously), but by bringing them all together, I save a significant amount of time whenever a new LLM is released. I will demonstrate examples of testing on the new OpenChat model.

Why is this important?

When it comes to new LLMs, it’s important to understand their capabilities and limitations. Unfortunately, figuring out how to deploy the model and then systematically testing it can be a bit of a drag. This process is often manual and can consume a lot of time. However, with a standardised approach, we can iterate much faster and quickly determine whether a model is worth investing more time in, or if we should discard it. So, let’s get started.

Getting Started

There are many ways to utilise an LLM, but when we distil the most common uses, they often pertain to open-ended tasks (e.g. generating text for a marketing ad), chatbot applications, and Retrieval Augmented Generation (RAG). Correspondingly, I employ relevant methods to test these capabilities in an LLM.

0. Deploying the model

Before we get started with the evaluation, we first need to deploy the model. I have boilerplate code ready for this, where we can just swap out the model ID and the instance to which to deploy (I’m using Amazon SageMaker for model hosting in this example) and we’re good to go:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

  role = sagemaker.get_execution_role()
except ValueError:
  iam = boto3.client('iam')
  role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

model_id = "openchat/openchat_8192"
instance_type = "ml.g5.12xlarge" # 4 x 24GB VRAM
number_of_gpu = 4
health_check_timeout = 600 # how much time do we allow for model download

# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID': model_id,
  'SM_NUM_GPUS': json.dumps(number_of_gpu),
  'MAX_INPUT_LENGTH': json.dumps(7000),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(8192),  # Max length of the generation (including input text)

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(

model_name = hf_model_id.split("/")[-1].replace(".", "-")
endpoint_name = model_name.replace("_", "-")

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
# send request
  "inputs": "Hi, my name is Heiko.",

It’s worth noting that we can utilise the new Hugging Face LLM Inference Container for SageMaker, as the new OpenChat model is based on the LLAMA architecture, which is supported in this container.

1. Playground

Using the notebook to test a few prompts can be burdensome, and it may also discourage non-technical users from experimenting with the model. A much more effective way to familiarise yourself with the model, and to encourage others to do the same, involves the construction of a playground. I have previously detailed how to easily create such a playground in this blog post. With the code from that blog post, we can get a playground up and running quickly.

Once the playground is established, we can introduce some prompts to evaluate the model’s responses. I prefer using open-ended prompts, where I pose a question that requires some degree of common sense to answer:

How can I improve my time management skills?

Image by author

What if the Suez Canal had never been constructed?

Image by author

Both responses appear promising, suggesting that it could be worthwhile to invest additional time and resources in testing the OpenChat model.


The second thing we want to explore is a model’s chatbot capabilities. Unlike the playground, where the model is consistently stateless, we want to understand its ability to “remember” context within a conversation. In this blog post, I described how to set up a chatbot using the Falcon model. It’s a simple plug-and-play operation, and by changing the SageMaker endpoint, we can direct it towards the new OpenChat model.

Let’s see how it fares:

Image by author

The performance as a chatbot is quite impressive. There was an instance, however, where Openchat attempted to abruptly terminate the conversation, cutting off in mid-sentence. This occurrence is not rare, in fact. We don’t usually observe this with other chatbots because they employ specific stop words to compel the AI to cease text generation. The occurrence of this issue in my app is probably due to the implementation of stop words within my application.

Beyond that, OpenChat has the capability to maintain context throughout a conversation, as well as to extract crucial information from a document. Impressive. 😊

3. Retrieval Augmented Generation (RAG)

The last task we want to test involves using LangChain for some RAG tasks. I’ve found that RAG tasks can be quite challenging for open source models, often requiring me to write my own prompts and custom response parsers to achieve functionality. However, what I’d like to see is a model that operates optimally “out of the box” for standard RAG tasks. This blog post provides a few examples of such tasks. Let’s examine how well it performs. The question we’ll be posing is:

Who is the prime minister of the UK? Where was he or she born? How far is their birth place from London?

Image by author

This is, without a doubt, the best performance I’ve seen from an open-source model using the standard prompt from LangChain. This is probably unsurprising, considering OpenChat has been fine-tuned on ChatGPT conversations, and LangChain is tailored towards OpenAI models, particularly ChatGPT. Nonetheless, the model was capable of retrieving all three facts accurately using the tools at its disposal. The only shortcoming was that, in the end, the model failed to recognise that it possessed all the necessary information and could answer the user’s question. Ideally, it should have stated, “I now have the final answer,” and provided the user with the facts it had gathered.

Image by author

In this blog post, I’ve introduced you to three standard evaluation techniques that I use all the time to evaluate LLMs. We’ve observed that the new OpenChat model performs exceptionally well on all these tasks. Surprisingly, it appears very promising as the underlying LLM for a RAG application, probably just requiring customised prompting to discern when it has arrived at the final answer.

It’s noteworthy that this isn’t a comprehensive evaluation, nor is it intended to be. Instead, it offers an indication of whether a particular model is worth investing more time in and conducting further, more intensive testing. It seems that OpenChat is definitely worth spending time on 🤗

Feel free to use all the tools, expand and customise them, and start evaluating the LLMs that pique your interest within minutes.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.