The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
翻译:大语言模型的进步显著推动了代码生成领域的发展。先前的研究将强化学习与编译器反馈相结合,通过探索大语言模型的生成空间来提升代码生成质量。然而,面对复杂的人类需求,大语言模型生成的长序列代码使强化学习的探索面临挑战。同时,由于单元测试可能无法覆盖复杂代码,利用这些未执行代码片段来优化大语言模型效果不佳。为解决这些问题,我们提出StepCoder——一种面向代码生成的新型强化学习框架,包含两个核心组件:CCCS将长序列代码生成任务分解为代码补全子任务课程,从而解决探索难题;FGO则通过遮蔽未执行代码段进行细粒度优化,仅对模型进行针对性优化。此外,我们进一步构建了经人工验证确保单元测试正确性的APPS+数据集用于强化学习训练。实验结果表明,本方法能有效提升对输出空间的探索能力,并在相应基准测试中超越现有最优方法。