During Automated Program Repair (APR), it can be challenging to synthesize correct patches for real-world systems in general-purpose programming languages. Recent Large Language Models (LLMs) have been shown to be helpful "copilots" in assisting developers with various coding tasks, and have also been directly applied for patch synthesis. However, most LLMs treat programs as sequences of tokens, meaning that they are ignorant of the underlying semantics constraints of the target programming language. This results in plenty of statically invalid generated patches, impeding the practicality of the technique. Therefore, we propose Repilot, a general code generation framework to further copilot the AI "copilots" (i.e., LLMs) by synthesizing more valid patches during the repair process. Our key insight is that many LLMs produce outputs autoregressively (i.e., token by token), resembling human writing programs, which can be significantly boosted and guided through a Completion Engine. Repilot synergistically synthesizes a candidate patch through the interaction between an LLM and a Completion Engine, which 1) prunes away infeasible tokens suggested by the LLM and 2) proactively completes the token based on the suggestions provided by the Completion Engine. Our evaluation on a subset of the widely-used Defects4j 1.2 and 2.0 datasets shows that Repilot outperforms state-of-the-art techniques by fixing 27% and 47% more bugs, respectively. Moreover, Repilot produces more valid and correct patches than the base LLM with the same budget. While we focus on leveraging Repilot for APR in this work, the overall approach is also generalizable to other code generation tasks.
翻译:在自动程序修复(APR)过程中,为通用编程语言编写的真实世界系统合成正确的补丁具有挑战性。近期的大型语言模型(LLMs)已被证明是协助开发者完成各项编码任务的有效“副驾驶”,并已被直接应用于补丁合成。然而,大多数LLM将程序视为标记序列,这意味着它们忽略了目标编程语言的底层语义约束,导致生成大量静态无效的补丁,阻碍了该技术的实用性。为此,我们提出Repilot——一个通用的代码生成框架,通过合成修复过程中更有效的补丁,进一步为AI“副驾驶”(即LLM)提供辅助。我们的核心洞察是:许多LLM以自回归方式(逐标记)生成输出,类似于人类编写程序的过程,而这一过程可通过补全引擎显著增强和引导。Repilot通过LLM与补全引擎之间的协同交互来合成候选补丁,其核心机制包括:1)剪除LLM建议的不可行标记;2)基于补全引擎的建议主动完成标记。我们在广泛使用的Defects4j 1.2和2.0数据集子集上的评估表明,Repilot分别多修复了27%和47%的缺陷,性能优于现有最先进技术。此外,在相同预算条件下,Repilot生成的补丁比基础LLM更有效且正确。虽然本工作聚焦于将Repilot应用于APR,但整体方法也可推广至其他代码生成任务。