A group of researchers from The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory, and City University of Hong Kong have developed a new method to significantly improve the mathematical reasoning capabilities of open-source large language models (LLMs) like Llama and CodeLlama.
In their paper “MATHCODER: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning“, the researchers introduce MathCoder, a framework that includes a novel math instruction-following dataset called MathCodeInstruct and a customized supervised fine-tuning approach.
MathCodeInstruct contains over 80,000 math problems paired with solutions that interleave natural language, code, and execution results in a format the researchers call LCE (Language, Code, Execution). The key highlights of this dataset are:
- Solutions are collected from the powerful closed-source model GPT-4 Code Interpreter, ensuring high quality
- Additional problems are generated using an innovative prompting technique called “problem interpolation”, creating intermediate difficulty level problems between basic and advanced math questions
- Multiple LCE solutions are distilled for each generated problem to further validate quality
The supervised fine-tuning approach trains models like Llama and CodeLlama on this dataset while executing code blocks in real-time. This allows the model to assess the execution results and continue reasoning accordingly, similar to how the GPT-4 Code Interpreter operates.
Evaluation on benchmarks like GSM8K and MATH shows MathCoder models substantially outperforming other open-source methods for math problem solving. Impressively, MathCoder achieves state-of-the-art scores of 83.9% on GSM8K and 45.2% on MATH, even surpassing proprietary models like ChatGPT-3.5 and PaLM-2.
The key implications of this work are:
- Demonstrating the possibility of integrating reasoning, coding, and execution in open-source models to enhance complex reasoning capabilities
- Providing an effective framework and high-quality dataset to train performant math problem-solving models without needing massive resources
- Closing the gap between open-source and closed-source LLMs on mathematical challenges
Possible use cases include tutoring systems, quantitative analysis, financial modeling, and scientific computing. The availability of accurate open-source math models could make AI assistance more accessible for math-heavy fields.
The researchers plan to release the dataset and models to spur progress in this domain. While math is the focus here, the principles used in MathCoder could eventually generalize to other reasoning-based tasks involving computation.