The Fact Selection Problem in LLM-Based Program Repair

Recent research has shown that incorporating bug-related facts, such as stack traces and GitHub issues, into prompts enhances the bug-fixing capabilities of large language models (LLMs). Considering the ever-increasing context window of these models, a critical question arises: what and how many facts should be included in prompts to maximise the chance of correctly fixing bugs? To answer this question, we conducted a large-scale study, employing over 19K prompts featuring various combinations of seven diverse facts to rectify 314 bugs from open-source Python projects within the BugsInPy benchmark. Our findings revealed that each fact, ranging from simple syntactic details like code context to semantic information previously unexplored in the context of LLMs such as angelic values, is beneficial. Specifically, each fact aids in fixing some bugs that would remain unresolved or only be fixed with a low success rate without it. Importantly, we discovered that the effectiveness of program repair prompts is non-monotonic over the number of used facts; using too many facts leads to subpar outcomes. These insights led us to define the fact selection problem: determining the optimal set of facts for inclusion in a prompt to maximise LLM's performance on a given task instance. We found that there is no one-size-fits-all set of facts for bug repair. Therefore, we developed a basic statistical model, named Maniple, which selects facts specific to a given bug to include in the prompt. This model significantly surpasses the performance of the best generic fact set. To underscore the significance of the fact selection problem, we benchmarked Maniple against the state-of-the-art zero-shot, non-conversational LLM-based bug repair methods. On our testing dataset of 157 bugs, Maniple repairs 88 bugs, 17% above the best configuration.

翻译：近期研究表明，将堆栈跟踪和GitHub问题等与错误相关的事实纳入提示词，能增强大语言模型（LLM）的缺陷修复能力。鉴于这些模型不断扩展的上下文窗口，一个关键问题随之产生：为最大限度提高正确修复错误的概率，提示词中应包含哪些事实及其数量？为解答该问题，我们开展了一项大规模研究，在BugsInPy基准测试中，针对开源Python项目的314个错误，采用包含七种不同事实的19,000余种提示组合进行修复实验。研究发现：从代码上下文等简单语法细节，到天使值等尚未在LLM领域探索的语义信息，每种事实均有助益——每种事实都能帮助修复某些若不包含该事实则无法修复或修复成功率低下的错误。更重要的是，我们发现程序修复提示词的有效性随使用事实数量呈非单调变化：使用过多事实反而会导致效果欠佳。基于这些发现，我们定义了事实选择问题：即为最大化LLM在特定任务实例中的性能，确定应纳入提示词的最优事实集合。研究表明，不存在适用于所有错误修复的通用事实集。为此，我们开发了名为Maniple的简易统计模型，可根据特定错误选择与其关联的事实纳入提示词。该模型显著超越了最优通用事实集的性能。为突出事实选择问题的关键性，我们将Maniple与当前最先进的零样本非对话式LLM缺陷修复方法进行对比测试。在包含157个错误的测试数据集中，Maniple成功修复88个错误，较最优配置提升17%。