The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. Our dataset APPS+ and StepCoder are available online.
翻译:摘要:大语言模型(LLMs)的进步显著推动了代码生成领域的发展。以往的研究将强化学习(RL)与编译器反馈相结合,以探索LLMs的输出空间,从而提升代码生成质量。然而,LLMs针对复杂人类需求生成的长篇幅代码使得RL探索面临挑战。此外,由于单元测试可能无法覆盖复杂代码,利用这些未执行代码片段优化LLMs收效甚微。为应对这些挑战,我们提出StepCoder——一个新颖的用于代码生成的RL框架,其包含两大核心组件:CCCS通过将长序列代码生成任务分解为代码完成子任务课程来应对探索难题,而FGO则通过屏蔽未执行代码段实现细粒度优化,仅对模型进行针对性更新。我们还构建了经人工验证以确保单元测试正确性的APPS+数据集用于RL训练。实验结果表明,我们的方法提升了输出空间探索能力,并在相应基准测试中超越了现有最优方法。APPS+数据集与StepCoder现已开源。