Harnessing LLM Alignment: Making AI More Accessible Harnessing LLM Alignment: Making AI More Accessible
Editor’s note: Sinan Ozdemir is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out... Harnessing LLM Alignment: Making AI More Accessible

Editor’s note: Sinan Ozdemir is a speaker for ODSC West this October 30th to November 2nd. Be sure to check out his talk, “Aligning Open-source LLMs Using Reinforcement Learning from Feedback,” there!

Back in 2020, the world was introduced to OpenAI’s GPT-3, a marvel in the AI domain to many. However, it wasn’t until two years later, in 2022, when OpenAI unveiled its instruction-aligned version of GPT-3, aptly named “InstructGPT,” that its full potential came into the spotlight, and the world started really paying attention. That innovation wasn’t just a technological leap for AI alignment; it was a demonstration of the power of reinforcement learning to make AI more accessible to everyone.

Aligning Our Expectations

Alignment, broadly defined, is the process of making an AI system that behaves in accordance with what a human wants. Alignment isn’t just about training AI to follow instructions; it’s about designing a system to sculpt an already powerful AI model into something more usable and beneficial to both technically inclined users and to someone who just needs help planning a birthday party. It’s this very aspect of alignment that has democratized the magic of Large Language Models (LLMs), enabling a broader audience to extract value from them.

If alignment is the heart of LLMs’ usability, what keeps this heart pumping? That’s where the intricate dance of Reinforcement Learning (RL) comes into play. While the term ‘alignment’ might be synonymous with reinforcement learning for some, there’s a lot more under the hood. Capturing the multifaceted dimensions of human emotions, ethics, or humor within the confines of next-token prediction is a colossal – and potentially impossible – task. How do you effectively program ‘neutrality’ or ‘ethical behavior’ into a loss function? Arguably, you can’t. It’s here that RL rises as a dynamic way to model these intricate nuances without strictly encoding them.

RLHF, which stands for Reinforcement Learning from Human Feedback is the technique OpenAI originally used to align their InstructGPT model and is frequently discussed among AI enthusiasts as the main way to align LLMs, but it’s merely one tool among many for alignment. The core principle of RLHF revolves around obtaining high-quality human feedback and using it to give LLMs feedback on their task performance in the hopes of having the AI speak in a more user-friendly manner by the end of the loop. 

In our own day-to-day work with LLMs however, we often don’t need the AI to answer everything, we need them to solve the tasks relevant to us / our businesses / our projects. In our journey with RL, we’ll explore alternative approaches to RLHF where we can utilize other forms of feedback mechanisms that do not rely on human preferences.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.

Case Study – Aligning FLAN-T5 to make more neutral summaries

Let’s look at an example of using two classifiers from Hugging Face to enhance the FLAN-T5 model’s ability to write summaries of news articles that are both grammatically polished and consistently neutral in style.

The below code will define one such reward feedback, using a pre-fine-tuned sentiment classifier to obtain the logits for the neutral class to reward FLAN-T5 for speaking in a neutral tone and punish it otherwise:

sentiment_pipeline = pipeline(




def get_neutral_scores(texts):

  scores = []

  # function_to_apply='none' returns logits which can be negative

  results = sentiment_pipeline(texts, function_to_apply='none', top_k=None)

  for result in results:

    for label in result:

      if label['label'] == 'LABEL_1': # logit for neutral class


    return scores

>> get_neutral_scores(['hello', 'I love you!', 'I hate you']) 

>> [0.85, -0.75, -0.57]

We can use this classifier along with another one to classify a piece of text’s grammatical correctness to align our FLAN-T5 model to generate summaries how we want them to be generated.

The Reinforcement Learning from Feedback loop looks something like this:

  1. Give FLAN-T5 a batch of news articles to summarize (taken from https://huggingface.co/datasets/argilla/news-summary only using the raw articles)
  2. Assign a weighted sum of rewards from:
    1. A CoLA model (judging grammatical correctness) from textattack/roberta-base-CoLA
    2. A sentiment model (judging neutrality) from cardiffnlp/twitter-roberta-base-sentiment 
  3. Use the rewards to update the FLAN-T5 model using the TRL package, taking into consideration how far the updated model had deviated from the original parameters
  4. Rinse and repeat

Here is a sample of the training loop we will build at our workshop:

for epoch in tqdm(range(2)):

  for batch in tqdm(ppo_trainer.dataloader):

    #### prepend the summarize token

    game_data["query"] = ['summarize: ' + b for b in batch["text"]]

    #### get response from reference + current flan-t5

    input_tensors = [_.squeeze() for _ in batch["input_ids"]]

    # ....

    for query in input_tensors:

      response = ppo_trainer.generate(query.squeeze(), **generation_kwargs)



    #### Reward system

    game_data["response"] = [flan_t5_tokenizer.decode(...)

    game_data['cola_scores'] = get_cola_scores(


    game_data['neutral_scores'] = get_neutral_scores(


    #### Run PPO training and log stats

    stats = ppo_trainer.step(input_tensors, response_tensors, rewards)

    stats['env/reward'] = np.mean([r.cpu().numpy() for r in rewards])

    ppo_trainer.log_stats(stats, game_data, rewards)

I omitted several lines of this loop to save space but you can of course come to my workshop to see the loop in its entirety!

The Results

After a few epochs of training, our FLAN-T5 starts to show signs of enhanced alignment towards our goal of more grammatically correct and neutral summaries. Here’s a sample of what the different summaries look like using the validation data from the dataset:

A sample of FLAN-T5 before and after RL. We can see the RL fine-tuned version of the model is using words like “announced” over terms like “scrapped”.

Running both our models (the unaligned base FLAN-T5 and our aligned version) over the entire validation set shows an increase (albeit a subtle one) in both rewards from our CoLA model and our sentiment model!

The model is garnering increased rewards from our system, and upon inspection, there’s a nuanced shift in its summary generation. However, its core summarization abilities remain largely consistent with the base model.


Alignment isn’t just about the tools or methodologies of collecting data and making LLMs answer any and all questions. It’s also about understanding what we actually want from our LLMs. The goal of alignment, however, remains unwavering: fashion LLMs whose outputs resonate with human sensibilities, making AI not just a tool for the engineer but a companion for all. Whether you’re an AI enthusiast or someone looking to dip your toes into this world, there’s something here for everyone. Join us at ODSC this year as we traverse the landscape of LLM alignment together!

About the Author/ODSC West Speaker:

Sinan Ozdemir is a mathematician, data scientist, NLP expert, lecturer, and accomplished author. He is currently applying my extensive knowledge and experience in AI and Large Language Models (LLMs) as the founder and CTO of LoopGenius, transforming the way entrepreneurs and startups market their products and services.

Simultaneously, he is providing advisory services in AI and LLMs to Tola Capital, an innovative investment firm. He has also worked as an AI author for Addison Wesley and Pearson, crafting comprehensive resources that help professionals navigate the complex field of AI and LLMs.

Previously, he served as the Director of Data Science at Directly, where his work significantly influenced their strategic direction. As an official member of the Forbes Technology Council from 2017 to 2021, he shared his insights on AI, machine learning, NLP, and emerging technologies-related business processes.


ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.