Though many approaches have been proposed for Automated Program Repair (APR) and indeed achieved remarkable performance, they still have limitations in fixing bugs that require analyzing and reasoning about the logic of the buggy program. Recently, large language models (LLMs) instructed by prompt engineering have attracted much attention for their powerful ability to address many kinds of tasks including bug-fixing. However, the quality of the prompt will highly affect the ability of LLMs and manually constructing high-quality prompts is a costly endeavor. To address this limitation, we propose a self-directed LLM-based automated program repair, ThinkRepair, with two main phases: collection phase and fixing phase. The former phase automatically collects various chains of thoughts that constitute pre-fixed knowledge by instructing LLMs with the Chain-of-Thought (CoT) prompt. The latter phase targets fixing a bug by first selecting examples for few-shot learning and second automatically interacting with LLMs, optionally appending with feedback of testing information. Evaluations on two widely studied datasets (Defects4J and QuixBugs) by comparing ThinkRepair with 12 SOTA APRs indicate the priority of ThinkRepair in fixing bugs. Notably, ThinkRepair fixes 98 bugs and improves baselines by 27%-344.4% on Defects4J V1.2. On Defects4J V2.0, ThinkRepair fixes 12-65 more bugs than the SOTA APRs. Additionally, ThinkRepair also makes a considerable improvement on QuixBugs (31 for Java and 21 for Python at most).
翻译:尽管已有多种自动化程序修复方法被提出并取得了显著成效,但在修复需要分析和推理错误程序逻辑的缺陷时仍存在局限。近年来,通过提示工程指导的大语言模型因其在包括缺陷修复在内的多种任务中展现的强大能力而备受关注。然而,提示的质量会极大影响大语言模型的能力,而人工构建高质量提示成本高昂。为克服这一局限,我们提出了一种基于大语言模型的自导向自动化程序修复方法ThinkRepair,该方法包含两个主要阶段:收集阶段与修复阶段。前一阶段通过使用思维链提示指导大语言模型,自动收集构成预修复知识的多样化思维链。后一阶段以修复缺陷为目标,首先为少样本学习选择示例,其次自动与大语言模型交互,并可选择性地附加测试信息的反馈。通过在两个广泛使用的数据集上,将ThinkRepair与12种最先进的自动化程序修复方法进行比较评估,结果表明ThinkRepair在缺陷修复方面具有优势。值得注意的是,在Defects4J V1.2上,ThinkRepair修复了98个缺陷,较基线方法提升了27%-344.4%。在Defects4J V2.0上,ThinkRepair比最先进的自动化程序修复方法多修复12-65个缺陷。此外,ThinkRepair在QuixBugs数据集上也取得了显著改进(Java最多修复31个,Python最多修复21个)。