fbpx
Making Your LLM More Friendly With Fine-Tuning Making Your LLM More Friendly With Fine-Tuning
Editor’s note: Ben is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Ben Needs... Making Your LLM More Friendly With Fine-Tuning

Editor’s note: Ben is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Ben Needs a Friend – An intro to building Large Language Model applications,” there!

As part of my upcoming ODSC tutorial on Large Language Model (LLM) development, I’ve been writing a series diving deep into the topics I’ll be covering. One (pretty big) topic I cover is fine-tuning. In this post, I consolidated a few parts of the series into a bite-sized dive into fine-tuning. If you want even more, I encourage you to check out my series or come to my tutorial this Spring!

LLMs, despite all the media hype these days, aren’t exactly new. In fact, the “Language Model” part of LLMs goes back (depending on how you count) to ancient times (i.e. the 1940s). The idea is that the probability of the next word in a passage is a product of the previous words. This concept is at the root of modern LLMs; they’re trained to predict a word given its context.

Basically, auto-complete. Attach some deep learning dark magic to this concept and you get this:

This is essentially where we were with the previous era of generative models. Large, complex models were trained on massive text corpora to predict the next word. The result was a sensible continuation of input text, which makes a cute toy but it’s not very useful.

Enter — fine-tuning. The main innovation that separates the pre-ChatGPT era from the post-ChatGPT era is a second stage of training that “tunes” the model to output relevant responses based on instructions. A simple example:

Instruction: “What are the three primary colors?”

Output: “The three primary colors are red, blue, and yellow.”

This process is called “instruction tuning” and the results speak for themselves. A 2022 paper from Google researchers showed the difference between their PaLM model trained without and with (Flan-PaLM) instruction tuning:

A model like the one that powers ChatGPT has (probably¹) undergone several rounds of fine-tuning to make it the world-ending technology it is today. These rounds fall into (roughly) the following categories:

  1. Pre-training — The model is trained on general text corpus to predict the next word
  2. Supervised Fine Tuning (SFT) — The model is trained to produce a certain output given an input

Not included here is Reinforcement Learning from Human Feedback (RLHF). But I talk about that in another post and I’m not going to be doing that here. Since it’s also a supervised form of fine-tuning, it does sort of fit in that second category.

But the result is by no means perfect. One of the most popular applications of LLMs is code generation. For code generation benchmarks, often the most performant models are fine-tuned specifically on code². So, depending on your use case, you may want to undertake your own fine-tuning.

My use case for my upcoming ODSC tutorial is to make myself an LLM friend. I was actually inspired to take this on by And who writes better friends than…Friends?

How to tune friends and influence models

My goal here is to make my LLM talk like a character on the TV show Friends. Why? Because I’m on the internet and I get what I want.

To do this, I needed text in the target style. Luckily Kaggle has a dataset of Friends scripts. I assume it is for exactly this purpose.

There were two ways I can use these data to accomplish my goal:

(A) Train a model on all dialogue

In this case, I ask the model to predict the next word of all character dialogue. It learns the Friends “tone”, and therefore generates more Friends script-like text. This is similar to the pre-training objective (1) described above.

(B) Train a model on paired exchanges

Here, I’d give the model pairs of lines; exchanges between characters. The model would learn to generate responses in the style of a Friends character. This is similar to the SFT objective (2) described above.

I actually ended up trying both. I’ll start with B since it was fairly straightforward to implement with OpenAI’s APIs. Then I’ll move on to A using all open-source and free utilities!

Friends-GPT — Supervised Fine Tuning

A small company you may have heard of, OpenAI, provides a useful set of tools to fine-tune their proprietary LLMs. Their documentation seems to revolve around SFT, so I used it for that.

In this case, we’re looking at training approach B — training on pairs of character dialogue. Check out this notebook to follow along. Note: if you want to try this you’ll need your own OpenAI API key. You’ll also need to pay for API usage.

First, I developed a system prompt; which gave the model instructions on how to generate a response to input:

Your name is Friend. You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

My input prompt was:

What are we doing tonight?

I ran this through the base GPT 3.5 model to see the result:

Friend: Oh, I thought you were planning a surprise party for me. Should I cancel the inflatable unicorn bouncy castle I ordered?

Generally, GPT 3.5 does not make a good friend. It’s kind of heavy on the sarcasm. It also at times generates “Friend:” as a prefix, but not consistently.

For fine-tuning, I formatted the Friends scripts to include the system prompt above and an input prompt as a line from one of the main Friends characters. The output for training was the subsequent line (from another character). My thinking is that these paired lines model the behavior I want the bot to imitate.

For fine-tuning, the documentation says that 50–100 examples should result in improvement. To save time and money, I went with the low end, providing the fine-tuning job 50 examples. After several epochs of training, I receive my shiny new Friends-GPT.

Provided with the input prompt above, Friends-GPT output:

Well, I kinda had plans to stay in tonight. Throw a few punches, do a little fight clubbin’.

Friends-GPT definitely has a tone that’s more in line with a Friends episode, but it does generate some odd stuff. And it’s really interested in Ross, arguably the worst character:

Well, we could go to the coffeehouse and..(notices Ross hasn’t left) or… I could stick my hand in the mashed potatoes again.

One issue here is that I don’t really know how the fine-tuning process works. OpenAI keeps its proprietary wall pretty high, which makes it difficult to debug strange outcomes like this one. It provides you a metric for measuring performance, but very little else.

So for our next experiment, let’s do it ourselves. Since I (and probably you) don’t have OpenAI money, that means we’ll have to figure out our own compute.

OpenAI HQ, I assume

And that means we’ll be looking for some optimizations.

The inevitable maths — Quantization and Low-Rank Adapters (LoRA)

One thing you might be wondering is how I’m going to manage a process of tuning 7 billion parameters. The answer is…I’m not.

Previous eras of fine-tuning involved a paltry few hundred million parameters. Tuning an entire LLM introduces a slew of new challenges even if you have access to GPUs.

One of the two optimizations I’ll be using is the 8-bit quantization of the model parameters. Huggingface has a great article explaining it in detail. The light touch explanation is that the individual model parameters are represented with lower precision. So if you round a value like 16.5 to 17, you’re essentially performing quantization; representing the number at lower precision. This isn’t exactly how it happens under the hood of the method we’re applying, but it’s the same concept. As you can imagine, lower precision means the model performs slightly worse, but the compute required is greatly reduced.

To train the model, I will be using a Low-Rank Adapter (LoRA). Let’s break that acronym down: An “adapter” is a layer that is introduced into the transformer architecture that can be trained along with or instead of the other transformer layers (more detail). “Instead of” is important here — it means that the other layers can be frozen while just the adapter layers are trained. Rather than updating many billions of parameters, you’re just updating the adapter parameters.

Pictures, thousand words, etc:

Traditional fine-tuning vs Adapter fine-tuning. Note that “weights” refer to the parameters being updated during training.

“Vanilla” adapters are basically new layers being added to the model. This means they increase the size of the model (number of parameters). Even though training is more efficient, prediction (inference) is slower.

LoRA tries to solve that problem by directly connecting to the existing model parameters. It takes a layer’s input and learns a low-rank approximation of the update for that layer. Instead of that big 𝛥W matrix, it learns a simplified version that contains approximately the same information as the full update. The size of this simplification is governed by a parameter r; the rank of the decomposition.

No new layers — it just adds directly to the layer to which it is attached.

Note: the “W” in A and B don’t necessarily have the same number of dimensions, even though this picture makes it look that way

Basically, the LoRA breaks down what needs to change about the model in simpler terms. Think of it like the Obama anger translator.

source

Whew — how are we doing? Ready for some code?

The Friends-ly model

For this I’ll be using a Kaggle notebook. They’re much more generous with their free GPUs.

The first lesson I learned was one I should already know by now; data matters. My goal here was to make the model speak like a Friends character. To that end, I wanted to train the model similar to how it was pre-trained; on minimally processed text. Compare that to the instruction tuning we performed with our Friends-GPT.

But what is minimally processed text?! I ended up trying three different datasets. Those results are available in the full post. But the most sensible results came from providing the model with chunks of dialogue:

Joey: Hey, why can’t we use the same toothbrush, but we can use the same soap?,

Chandler: Because soap is soap. It’s self-cleaning.

Joey: Alright, well next time you take a shower, think about the last thing I wash and the first thing you wash.

I used the same prompt as above:

Your name is Friend. You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Ben: What should we do tonight?

Base model (Llama-2 7B Parameters)

Friend: 🤔😓 Oh boy, do we have to? 😅 Can’t we just stay in and watch Netflix? 📺😴

Friends-ly Model (3 epochs of training)

Friend: Oh, I don’t know, what are you up for?

Ross: (entering) Hey Ben!

Monica: Hi!

Ross: Hey!

Monica: So, you ready to get married?

It does generate some strange stuff, but it definitely resembles a Friends episode. Though, that may have come at the cost of it being particularly useful:

Prompt

Tell me about alpacas.

Base model

Alpacas are domesticated animals that are native to South America. They are members of the camel family and are known for their soft, luxurious fleece and gentle nature.

Friends-ly model

Chandler: Oh, they’re just so cute!

Phoebe: Oh, you’ve seen them?

Chandler: Yeah, I’m all about the alpacas.

I also kind of love this one

Chandler: They’re like little ponies.

Chandler: (to Joey) I’m sorry. (Joey is upset) Oh, I’m sorry. I didn’t mean to hurt you.

Ending thoughts

Spending a week training a high-powered ML model to sound like a Chandler Bing will get you thinking. Reflecting on the results, I think that maybe the instruction-tuning approach might have been the more relevant one for this use case.

What I wanted was a chatbot that could talk like a Friends character in conversation. For that purpose, setting up the input/output pairing as a conversation (as was the case in the instruction tuning) seems like the best approach. Using minimally processed text just enabled the model to generate more of it, not necessarily converse.

And, honestly, if making an AI-generated ’90s show was my goal…that’s sort of been done.

If you liked this, definitely check out the rest of my series on LLM development. Or, if you really liked it, join me at ODSC this April!

Play us out, Friends-Llama:

Ben: Hey gang, how do you like this blog post? (shows them the blog post)

Ross: I don’t know, I’m not very tech savvy.

Phoebe: (reading the blog post) Ooh, this is so cool! (reading the comments) Wow, there are so many people who don’t like it.

The adapter diagrams were inspired/adapted from these from Lightning AI.

I took inspiration from this walkthrough, which builds a full application around this. Really interesting read!

If you’re interested in a deeper dive into these topics, there are some excellent blog posts on adapters and LLM training more generally.

  1. OpenAI’s models are not open source and they’re pretty cagey about what’s going on under the hood of ChatGPT. So we can’t really say much about how they currently work or will work in the future.
  2. As of this writing (Jan 2024) — the highest performer on the HumanEval benchmark is GPT-4. Related to the note above — it’s actually not clear how GPT-4 works and earlier versions of OpenAI’s APIs used a specially trained model called Codex.

Ben Batorsky is a Data Science consultant working with multiple sectors on building AI strategies and applications. Previously, he led Data Science teams in academia (Northeastern University, MIT) and industry (ThriveHive). He obtained his Masters in Public Health (MPH) from Johns Hopkins and his PhD in Policy Analysis from the Pardee RAND Graduate School. Since 2014, he has been working in data science for government, academia and industry. His major focus has been on Natural Language Processing (NLP) technology and applications. Throughout his career, he has pursued opportunities to contribute to the larger data science community. He has presented his work at conferences, published articles, taught courses in data science and NLP, and is co-organizer of the Boston chapter of PyData. He also contributes to volunteer projects applying data science tools for public good.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1