A new study reveals that OpenAI’s latest version of GPT-4, known as the GPT-4 Code Interpreter, possesses remarkable mathematical reasoning capabilities when equipped with code generation and self-verification techniques.
In a recent paper, researchers systematically analyzed GPT-4 Code Interpreter’s performance on math word problems. They found that its skill in solving these complex problems can be attributed to two key factors:
- Frequent and step-by-step code generation: By prompting the model to generate Python code frequently in small segments throughout its reasoning process, the accuracy of solutions improved substantially.
- Self-debugging through code execution: The model can evaluate code execution results and adjust its reasoning accordingly when nonsensical outputs are encountered. This self-debugging mechanism also boosted performance.
Building on these insights, the researchers introduced an innovative technique called explicit code-based self-verification (CSV). This prompts GPT-4 Code Interpreter to verify its own solutions by generating additional code, akin to a student double checking their exam work.
The key innovation of CSV is that it prompts GPT-4 Code Interpreter to leverage its own code generation skills to verify and refine solutions to math problems. This is done by appending an additional verification stage after the model provides its initial answer.
Specifically, the CSV prompt instructs GPT-4 Code Interpreter to:
- Generate new code to check if the solution is correct. This code essentially re-derives or confirms the answer.
- Run the verification code and inspect the output.
- If the code output indicates the solution is incorrect (i.e. returns False), the model is guided to reconsider its logic and provide an amended, accurate solution.
- Repeat steps 1-3 until the verification code returns True, signaling a verified solution.
Unlike prior techniques that rely on an external model or human to verify solutions, CSV enables autonomous self-verification within GPT-4 Code Interpreter using its code execution strengths.
This prompts the model to simulate a student double checking their own work, catching mistakes, and correcting erroneous reasoning – just as humans iteratively solve math problems. The additional verification stage and solution adjustments result in increased accuracy.
Remarkably, when an initial solution is deemed incorrect through self-verification, GPT-4 Code Interpreter can autonomously modify its logic and provide an improved answer. The CSV prompting delivered significant gains in accuracy across challenging math datasets:
GSM8K Dataset:
- GPT-4 Code Interpreter alone achieved 92.9% accuracy on GSM8K.
- With CSV and weighted majority voting, accuracy increased to 97.0%.
This represents a new state-of-the-art result on GSM8K, outperforming previous best methods like GPT-3.5 (92.0%) and GPT-4 with model selection (96.8%).
MMLU-Math Dataset:
- GPT-4 Code Interpreter attained 87.5% accuracy on MMLU-Math with no fine-tuning.
- Adding CSV and weighted majority voting further improved accuracy to 89.2%.
Again, this establishes a new benchmark, significantly surpassing prior systems like Chinchilla (35.7%) and Galactica (41.3%) fine-tuned on MMLU-Math.
By leveraging self-verification, the reliability of solutions can be assessed. The researchers incorporated these confidence scores into a weighted majority voting scheme, further enhancing performance.
This study demonstrates that code generation and execution mechanisms allow large language models to excel at complex logical reasoning and mathematical tasks. The proposed techniques provide promising directions to amplify these strengths.
According to the authors, capabilities like symbolic reasoning and arithmetic computation have been long-standing challenges for AI systems. This work represents a pivotal advancement, bringing us closer to human-level mathematical problem solving.
The self-verification abilities indicate that models can introspect and debug solutions, reducing blind spots. Such autonomic skills could accelerate research and may generalize to other domains. However, further analysis is required to determine broader implications.