Training LLMs to Better Self-Debug and Explain Code

In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.

翻译：在代码生成领域，自我调试至关重要。它使得大语言模型能够基于执行反馈来优化其生成的代码。这一点尤为重要，因为对于复杂任务而言，一次尝试即生成正确解决方案颇具挑战性。先前关于自我调试的研究主要集中于通过为大语言模型提供少量示例的提示方法，这些方法在小型开源大语言模型上效果不佳。在本工作中，我们提出了一个训练框架，可显著提升大语言模型的自我调试能力。直观上，我们观察到，对错误代码进行一系列解释，随后进行代码优化，有助于大语言模型更好地分析错误代码并执行优化。因此，我们提出了一种自动化流程，通过生成大量解释与优化轨迹，并借助执行验证进行筛选，来收集高质量的代码解释与优化数据集。我们在成功与失败的轨迹上均进行了监督微调，并进一步结合强化学习，采用了一种综合考虑代码解释与优化质量的新型奖励设计。监督微调在四个基准测试上将pass@1最高提升了15.92%，将pass@10提升了9.30%。强化学习训练额外带来了最高3.54%的pass@1提升和2.55%的pass@10提升。经过训练的模型展现出迭代优化能力，能够持续不断地优化代码。最后，我们的人工评估表明，通过我们的框架训练的大语言模型能够生成更有用的代码解释，并帮助开发者更好地理解源代码中的错误。