Code generation plays a crucial role in various tasks, such as code auto-completion and mathematical reasoning. Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedback to improve one-off code generation performance. Furthermore, we propose reflection self-distillation and dynamically masked distillation to effectively utilize these reflection sequences. Extensive experiments on three benchmarks, i.e., HumanEval (+), MBPP (+), and MultiPl-E, demonstrate that models fine-tuned with our method achieve state-of-the-art performance. Notably, ReflectionCoder-DeepSeek-Coder-33B reaches pass@1 of 82.9 (76.8) on HumanEval (+) and 84.1 (72.0) on MBPP (+), on par with GPT-3.5-Turbo and Claude-3-opus, and surpasses early GPT-4. Beyond the code domain, we believe this approach can benefit other domains that focus on final results and require long reasoning paths. Code and data are available at https://github.com/SenseLLM/ReflectionCoder.
翻译:代码生成在代码自动补全和数学推理等多种任务中扮演着关键角色。先前的研究提出了多种方法来提升代码生成性能,包括整合编译器的反馈信息。受此启发,我们提出了ReflectionCoder,这是一种新颖的方法,通过有效利用整合编译器反馈构建的反思序列,来提升一次性代码生成的性能。此外,我们提出了反思自蒸馏和动态掩码蒸馏方法,以高效利用这些反思序列。在三个基准测试集(即HumanEval (+)、MBPP (+) 和 MultiPl-E)上进行的大量实验表明,使用我们方法微调的模型实现了最先进的性能。值得注意的是,ReflectionCoder-DeepSeek-Coder-33B 在 HumanEval (+) 上达到了 82.9 (76.8) 的 pass@1 分数,在 MBPP (+) 上达到了 84.1 (72.0) 的 pass@1 分数,其性能与 GPT-3.5-Turbo 和 Claude-3-opus 相当,并超越了早期的 GPT-4。除了代码领域,我们相信这种方法也能惠及其他关注最终结果且需要长推理路径的领域。代码和数据可在 https://github.com/SenseLLM/ReflectionCoder 获取。