The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, these RL methods have only used offline frameworks, limiting their exploration of new sample spaces. Additionally, current approaches that utilize unit test signals are rather simple, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code can be found at: https://github.com/Zyq-scut/RLTF.
翻译:程序合成(即代码生成)的目标是根据给定描述生成可执行代码。近年来,越来越多研究采用强化学习(RL)来提升大型语言模型(LLM)在代码生成任务中的表现。然而,现有RL方法仅使用离线框架,限制了其对新样本空间的探索能力。此外,当前利用单元测试信号的方法较为简单,未能考虑代码中的具体错误位置。为解决上述问题,我们提出了RLTF(Reinforcement Learning from Unit Test Feedback),一种基于多粒度单元测试反馈的新型在线RL框架,用于优化代码LLM。我们的方法在训练过程中实时生成数据,并同时利用细粒度反馈信号引导模型生成更高质量的代码。大量实验表明,RLTF在APPS与MBPP基准测试中均达到了最先进的性能。我们的代码见:https://github.com/Zyq-scut/RLTF。