fbpx
Fine-tuning LLMs on Slack Messages Fine-tuning LLMs on Slack Messages
Editor’s note: Eli Chen is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out... Fine-tuning LLMs on Slack Messages

Editor’s note: Eli Chen is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “Fine-tuning LLMs on Slack Messages,” there!

Fine-tuning LLMs is super easy, thanks to HuggingFace’s libraries. This tutorial walks you through adapting a pre-trained model to generate text that emulates chat conversations from a Slack channel. You should be comfortable with Python to get the most out of this guide.

Getting the data

First, obtain the data from Slack using the API. Before diving in, make sure you have your bot token, user ID, and channel ID handy. If you’re unsure how to acquire these, here’s a quick guide.

Initialize the Slack WebClient and define a function, fetch_messages, to pull specific chats filtered by a user ID.

token = "YOUR_SLACK_BOT_TOKEN"
channel_id = "CHANNEL_ID"
user_id = "USER_ID"
client = WebClient(token=token)

def fetch_messages(channel_id):
    messages = []
    cursor = None
    while True:
        try:
            response = client.conversations_history(channel=channel_id, cursor=cursor)
            assert response["ok"]
            for message in response['messages']:
                if 'user' in message and message['user'] == user_id:
                    messages.append(message['text'])
            cursor = response.get('response_metadata',{}).get('next_cursor')
            if not cursor:
                break
        except SlackApiError as e:
            print(f"Error: {e.response['error']}")
            break
    return messages
all_messages = fetch_messages(channel_id)

The function fetch_messages returns a list all_messages containing messages from the specified user in the given channel for fine-tuning LLMs.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

Fine-Tuning the Model

After collecting the messages, the next step is to fine-tune a pre-trained language model to mimic the specific language patterns of this particular user. The code below utilizes HuggingFace’s transformers library to streamline this process.

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set padding token to eos_token
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Tokenize the strings and create a Dataset object
tokens = tokenizer(all_messages, padding=True, truncation=True, return_tensors="pt")
dataset = Dataset.from_dict(tokens)

# Create DataCollator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Setup training arguments
training_args = TrainingArguments(
    output_dir="./output",
    learning_rate=2e-4,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Fine-tuning
trainer.train()

By running this code, you fine-tune a GPT-2 model to generate text mimicking the user’s messages. Feel free to experiment with different models and learning rates to better fit your needs.

Testing Your Model

After fine-tuning, you’ll want to test the model’s ability to mimic your user’s messages. The code below shows how to generate text based on the prompt “Hello “.

# Generate text
input_text = "Hello "
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate a text sample
output = model.generate(input_ids, max_length=50, num_return_sequences=1, 
temperature=1.0)

# Decode and print the text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

For more rigorous evaluations, consider adding performance metrics such as BLEU score or perplexity.

Conclusion

To conclude, you’ve walked through the basic steps for fine-tuning a pre-trained language model on a user’s Slack messages. While this serves as an introduction, there are numerous paths for enhancement, including incremental downloading, fine-tuning hyperparameters, developing per-user conversational models, and incorporating more comprehensive evaluation methods for bias.

For a deeper dive, join me at my ODSC West talk. I’ll discuss real-world training experiences, interesting and weird behaviors we observed over a year, and share insights on associated risks and mitigation strategies.

About the author

Eli is the CTO and Co-Founder at Credo AI. He has led teams building secure and scalable software at companies like Netflix and Twitter. Eli has a passion for unraveling how things work and debugging hard problems. Whether it’s using cryptography to secure software systems or designing distributed system architecture, he is always excited to learn and tackle new challenges. Eli graduated with an Electrical Engineering and Computer Science degree from U.C. Berkeley.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1