In the ever-evolving landscape of artificial intelligence, even the most advanced LLMs, including GPT-4 and PaLM 2, face challenges when it comes to solving complex mathematical problems. A recent study by researchers from Google and Yale hopes to shed light on how LLMs can overcome these hurdles and significantly improve their arithmetic problem-solving capabilities.
The study, conducted with the PaLM 2 model in both its small (PaLM 2-S) and large (PaLM 2-L) forms, reveals intriguing insights into the potential of LLMs. Initially, the research showcases that the models exhibit a higher probability of discovering accurate answers when allowed to tackle a problem multiple times.
For example, the pre-trained PaLM 2-L achieves an impressive 33.4% accuracy with greedy decoding; but, the study emphasizes that this performance can be further enhanced. When sampling 64 solutions using temperature sampling, a staggering 79.4% of the time, there is at least one accurate answer (pass@64).
This discrepancy highlights the LLMs‘ ability to generate accurate solutions while struggling to discern between proper and erroneous answers. To bridge this performance gap, the researchers explore three fine-tuning techniques:
- Supervised Step-by-Step Solution Fine-Tuning (SSFT): The study investigates whether pre-trained LLMs can benefit from a supervised fine-tuning step, aiming to provide a starting point technique. LLMs are adjusted to deliver entire solutions and answers.
- Solution-Cluster Reranking (SCR): This technique focuses on perfecting the generator as a solution evaluator for candidate solution reranking. The researchers introduce a novel method that combines the advantages of majority voting with reranking, efficiently categorizing candidate replies into groups based on mathematical equivalency.
- Sequential Multi-tasking Fine-Tuning: Beyond solution assessment, the study delves into enhancing LLMs’ performance in solution generation. By framing the solution assessment task as a natural language generation problem, the researchers aim to leverage it as valuable supervision for the solution generation model, adjusting the model in three stages.
The study’s findings on PaLM 2-S and PaLM 2-L underscore several key takeaways. SSFT’s dependence on well-formatted answers. The quality and style of step-by-step solutions significantly influence the refined model.
Efficiency of Reranking common solution clusters: Reranking only the most common solution clusters yields better performance and improved computational efficiency, presenting a potential standard practice for future work.
Dual-task training benefits: Training the model for both solution generation and evaluation tasks demonstrates improved performance. The proposed multi-task sequential fine-tuning proves more effective in enhancing the solution generation model compared to supervised solution fine-tuning alone.