In a new report, UC Berkeley researchers have introduced Starling-7B, a revolutionary large language model crafted using Reinforcement Learning from AI Feedback or RLAIF. Researchers hope that this model will help to redefine the landscape of natural language processing, incorporating cutting-edge technologies and methodologies.
Researchers point out that at the core of Starling-7B lies the GPT-4 labeled ranking dataset, Nectar. The data set boasts a substantial 183,000 chat prompts. Each of these presents seven responses from various models such as GPT-4, GPT-3.5-instruct, GPT-3.5-turbo, Mistral-7B-Instruct, and Llama2-7B.
According to the report, Nectar facilitates an extensive 3.8 million pairwise comparisons. To ensure fairness, researchers meticulously addressed positional bias when prompting GPT-4 for rankings, a process meticulously detailed in the dataset section.
Leveraging a novel reward model, researchers refined the Openchat 3.5 language model with impressive results. The AlpacaEval score surged from 88.51% to 91.99%, while the MT-Bench score rose from 7.81 to 8.09—two crucial metrics gauging the utility of the chatbot.
Testing Starling-7B against open-source models like Zephyra-7B, Neural-Chat-7B, and Tulu-2-DPO-70B using Direct Preference Optimization (DPO) revealed strong performance in Chatbot Arena. However, it fell short when compared to top SFT models like OpenHermes 2.5 and Openchat 3.5 in MT Bench.
Despite its merits, Starling-7B faces challenges. It proves vulnerable to deceptive methods, struggles with mathematical and reasoning tasks, and occasionally delivers outputs with questionable factual accuracy.
Recognizing these limitations, researchers are looking to refine Starling-7B by incorporating rule-based reward models, guided by GPT-4 techniques outlined in the technical report. However, it seems that Starling-7B represents a significant leap forward in large language models.
That’s because it can showcase the potential of Reinforcement Learning through AI Feedback, a collaboration between various models, and shared community knowledge enhancing the field of natural language processing.
Currently license for Starling-7B has the dataset, model, and online demo as a research preview, for non-commercial use only.