In this paper, we first show that increases in beam size, even for small-sized LLMs (1B-7B params), require extensive GPU usage, leading to up to 80% of recurring crashes due to memory overloads in LLM-based APR. Seemingly simple solutions to reduce memory consumption are (1) to quantize LLM models, i.e., converting the weights of an LLM from high-precision values to lower-precision ones, and (2) to make beam search sequential, i.e., forwarding each beam through the model sequentially and then concatenating them back into a single output. However, we show that these approaches still do not work via both theoretical analysis and experiments. To address this, we introduce FLAMES, a novel LLM-based APR technique that employs semantic-guided patch generation to enhance repair effectiveness and memory efficiency. Unlike conventional methods that rely on beam search, FLAMES utilizes greedy decoding to enhance memory efficiency while steering the search towards more potentially good repair candidates via a semantic-guided best-first search algorithm. At each decoding step, FLAMES uses semantic feedback from test validation, such as the number of passing and failing test cases, to select the most promising token to explore further. Our empirical evaluation on Defects4J shows thatFLAMES substantially reduces memory consumption by up to 83% compared to LLM-based APR without compromising time efficiency. Moreover, FLAMES correctly fixes 133 bugs on Defects4J, fixing 10 bugs more than the best baseline. Additionally, these improvements also generalize to the HumanEval-Java and TransformedD4J datasets, where FLAMES generates 12% and 36.5% more correct patches, respectively, than the best baseline.
翻译:本文首先指出,在基于大语言模型的自动程序修复中,即使对于小型大语言模型(10亿至70亿参数),增加束搜索的束宽也会显著增加GPU内存使用,导致高达80%的重复性崩溃源于内存过载。看似简单的降低内存消耗的解决方案包括:(1)对大语言模型进行量化,即将模型权重从高精度值转换为低精度值;(2)使束搜索顺序化,即依次将每个候选序列输入模型进行前向传播,然后将输出重新拼接。然而,我们通过理论分析和实验证明,这些方法仍然无法有效解决问题。为此,我们提出了FLAMES,一种新颖的基于大语言模型的自动程序修复技术,它采用语义引导的补丁生成来提升修复效果和内存效率。与传统依赖束搜索的方法不同,FLAMES利用贪心解码来提高内存效率,同时通过语义引导的最佳优先搜索算法引导搜索朝向更具潜力的修复候选。在每个解码步骤中,FLAMES利用来自测试验证的语义反馈(例如通过和失败的测试用例数量)来选择最有希望的令牌进行进一步探索。我们在Defects4J上的实证评估表明,与未优化的基于大语言模型的自动程序修复方法相比,FLAMES在保持时间效率的同时,将内存消耗大幅降低了高达83%。此外,FLAMES在Defects4J上正确修复了133个错误,比最佳基线多修复了10个错误。这些改进同样泛化至HumanEval-Java和TransformedD4J数据集,其中FLAMES分别比最佳基线多生成了12%和36.5%的正确补丁。