Google AI researchers have released a new paper proposing a new approach called Pairwise Ranking Prompting, or PRP for short. The goal is to alleviate the challenges faced by Large Language Models in solving text ranking problems. LLMs, such as GPT-3 and PaLM, have demonstrated remarkable performance on natural language tasks, even in zero-shot settings.
But, when it comes to text ranking, existing methods tend to fall short compared to trained baseline rankers, with the exception of black box systems like GPT-4. In the paper, the team acknowledges the value of black box systems, they also emphasize the constraints faced by academic researchers, including cost and access limitations.
So in their study, they delve into the reasons why LLMs struggle with ranking problems using the current pointwise and listwise approaches. According to the team, they found that generating calibrated prediction probabilities for pointwise techniques proves to be exceedingly challenging for LLMs.
Listwise techniques, on the other hand, result in inconsistent or irrelevant outputs, indicating a lack of ranking awareness in current LLM pre-training and fine-tuning techniques. So to compensate for this limitation and reduce issues related to task complexity, the researchers proposed the PRP paradigm.
This method utilizes a simple prompt architecture, employing a query and a pair of documents as the prompt for ranking tasks. Unlike existing methods, PRP offers both generation and scoring LLM APIs by default, addressing the calibration issue. Several PRP variations are discussed to ensure efficiency and effectiveness.
They went on to evaluate PRP using moderate-sized, open-sourced LLMs on traditional benchmark datasets. The results paid off as they surpassed previous methods based on the black box commercial GPT-4 with significantly larger model sizes.
One example of this was on the TREC-DL2020 dataset. The PRP based on the 20B parameter FLAN-UL2 model achieved a more than 5% improvement at NDCG@1 compared to the prior best method. On TREC-DL2019, PRP outperformed existing solutions such as InstructGPT by over 10% on most ranking measures, with slight performance degradation in NDCG@5 and NDCG@10 metrics compared to GPT-4.
Overall, the PRP exhibits several advantages, including its support for LLM APIs for scoring and generation, and its insensitivity to input orders. This work presents three major contributions. First, it demonstrates effective zero-shot ranking using moderate-sized, open-sourced LLMs. Next, the achievement of state-of-the-art ranking performance through straightforward prompting and scoring mechanisms.
And finally, the exploration of efficiency enhancements while maintaining good empirical performance.
Editor’s Note: Are you ready to learn about the latest in generative AI? Join us for the one-day Generative AI summit. Go beyond the hype and dive deeper into this cutting-edge technology. Register now for free and unlock the power of generative AI.