A new paper titled “RLTF: Reinforcement Learning from Unit Test Feedback” proposes a novel method for improving program synthesis using reinforcement learning and real-time feedback. The work comes from researchers at Tencent and was published Jul 10th 2023 on arXiv.
Program synthesis involves automatically generating code based on high-level descriptions of desired functionality. Recently, large language models (LLMs) like Codex and AlphaCode have shown impressive results on programming tasks. However, they still face challenges in ensuring full syntactic and functional correctness.
The paper introduces RLTF, which uses an online reinforcement learning framework to progressively improve an LLM’s code generation capabilities. The key ideas are:
- Online training loop: Two models with shared weights, one generates programs using the latest parameters and sends data to a buffer, the other consumes this data for training. This enables real-time feedback.
- Multi-granularity rewards: The framework provides coarse, fine-grained, and adaptive rewards based on unit test results to guide the model. Fine-grained rewards target specific code segments while adaptive rewards correlate to test pass ratio.
RLTF was evaluated on the APPS and MBPP benchmarks for program synthesis using CodeT5 and CodeGen models. It achieved new state-of-the-art results, outperforming prior RL methods like CodeRL and PPOCoder. Ablations verified the benefits of the online framework and multi-granularity rewards.
The work demonstrates how real-time feedback and targeted rewards can enhance program synthesis models. The online approach facilitates greater environment exploration and training stability. The fine-grained rewards provide nuanced guidance by isolating code deficiencies.
The improvements shown by RLTF could increase the reliability and capabilities of AI programming assistants. Next steps include incorporating more diverse training data and feedback from static analysis tools. By mastering code correctness, LLMs can get us closer to fully automating software development.