When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done?

In recent years, large language models (LLMs) have demonstrated substantial potential in addressing automatic program repair (APR) tasks. However, the current evaluation of these models for APR tasks focuses solely on the limited context of the single function or file where the bug is located, overlooking the valuable information in the repository-level context. This paper investigates the performance of popular LLMs in handling repository-level repair tasks. We introduce RepoBugs, a new benchmark comprising 124 typical repository-level bugs from open-source repositories. Preliminary experiments using GPT3.5 based on the function where the error is located, reveal that the repair rate on RepoBugs is only 22.58%, significantly diverging from the performance of GPT3.5 on function-level bugs in related studies. This underscores the importance of providing repository-level context when addressing bugs at this level. However, the repository-level context offered by the preliminary method often proves redundant and imprecise and easily exceeds the prompt length limit of LLMs. To solve the problem, we propose a simple and universal repository-level context extraction method (RLCE) designed to provide more precise context for repository-level code repair tasks. Evaluations of three mainstream LLMs show that RLCE significantly enhances the ability to repair repository-level bugs. The improvement reaches a maximum of 160% compared to the preliminary method. Additionally, we conduct a comprehensive analysis of the effectiveness and limitations of RLCE, along with the capacity of LLMs to address repository-level bugs, offering valuable insights for future research.

翻译：近年来，大语言模型（LLMs）在解决自动程序修复（APR）任务中展现出巨大潜力。然而，当前针对APR任务的模型评估仅局限于错误所在的单个函数或文件的有限上下文，忽视了仓库级上下文中蕴含的宝贵信息。本文研究了主流大语言模型处理仓库级修复任务的表现。我们提出了RepoBugs——一个包含124个来自开源仓库的典型仓库级错误的新基准测试。基于GPT3.5的初步实验表明，在仅利用错误所在函数的上下文中，RepoBugs上的修复率仅为22.58%，与相关研究中GPT3.5在函数级错误上的表现存在显著差异。这凸显了在解决仓库级错误时提供仓库级上下文的重要性。然而，初步方法提供的仓库级上下文往往冗余且不精确，且容易超出LLMs的提示长度限制。为解决该问题，我们提出了一种简单通用的仓库级上下文提取方法（RLCE），旨在为仓库级代码修复任务提供更精确的上下文。对三种主流LLMs的评估表明，RLCE显著增强了修复仓库级错误的能力。与初步方法相比，修复率的提升最高可达160%。此外，我们全面分析了RLCE的有效性与局限性，以及LLMs处理仓库级错误的能力，为未来研究提供了宝贵见解。