In a new paper, a team from Microsoft introduces phi-1 to the world. A new transformer-based large language model for code. Specialized in Python coding, it has a significantly smaller size compared to competing models. In the study, the team also investigates the impact of high-quality data on enhancing the performance of SOTA LLMS while reducing dataset size and training computation.
According to the team, the model utilizes “textbook quality” data. This includes synthetic generation from GPT-3.5 and web-sourced filtering. This is followed by fine-turning on “textbook-exercise-like” data. Which was done in the 1.3B- parameter model. So despite phi-1’s smaller size, it outperforms its larger competitors and is able to demonstrate the potential of high-quality data in optimizing LLM performance.
The paper also dives into the enhancement of data quality. This was most notable when it came to data cleaning. This is a critical step in generating modern datasets. This, in turn, could lead to a more streamlined series of datasets which provides to ability to iterate data more extensively. In terms of performance, the team attained 50.6% pass@1 accuracy on HumanEval and 55.5% pass@1 accuracy on MBPP (Mostly Basic Python Programs). This is one of the best self-reported numbers using only one LLM generation.
As mentioned above, what makes this significant is in terms of Python coding and the reduction of required computational resources with few datasets. The team at Microsoft showed that phi-1 is able to perform with all of this in mind and still achieve impressive accuracy scores when it comes to code-related tasks while still being orders smaller than competing models.
This could, in theory, help[ lead to more efficient and effective language models in the future, helping to reshape the market of the near future by providing developers and their organizations, with a new tool. Not only does it open up streamlined coding tasks, but can help tech-focused organizations and developers enhance their productivity while reducing environmental costs through resource use.