HAFix：基于历史增强的大型语言模型用于缺陷修复 (HAFix: History-Augmented Large Language Models for Bug Fixing)

from arxiv, Evaluated HAFix on two datasets (BugsInPy, Defects4J) and three LLMs (CodeLlama, DeepSeek-Coder, DeepSeek-Coder-V2) and optimized the figures and tables for better readability

Recent studies have explored the performance of Large Language Models (LLMs) on various Software Engineering (SE) tasks, such as code generation and bug fixing. However, these approaches typically rely on the context data from the current snapshot of the project, overlooking the potential of rich historical data residing in real-world software repositories. Additionally, the impact of prompt styles on LLM performance for SE tasks within a historical context remains underexplored. To address these gaps, we propose HAFix, which stands for History-Augmented LLMs on Bug Fixing, a novel approach that leverages seven individual historical heuristics associated with bugs and aggregates the results of these heuristics (HAFix-Agg) to enhance LLMs' bug-fixing capabilities. To empirically evaluate HAFix, we employ three Code LLMs (i.e., Code Llama, DeepSeek-Coder and DeepSeek-Coder-V2-Lite models) on 51 single-line Python bugs from BugsInPy and 116 single-line Java bugs from Defects4J. Our evaluation demonstrates that multiple HAFix heuristics achieve statistically significant improvements compared to a non-historical baseline inspired by GitHub Copilot. Furthermore, the aggregated HAFix variant HAFix-Agg achieves substantial improvements by combining the complementary strengths of individual heuristics, increasing bug-fixing rates by an average of 45.05% on BugsInPy and 49.92% on Defects4J relative to the corresponding baseline. Moreover, within the context of historical heuristics, we identify the Instruction prompt style as the most effective template compared to the InstructionLabel and InstructionMask for LLMs in bug fixing. Finally, we evaluate the cost of HAFix in terms of inference time and token usage, and provide a pragmatic trade-off analysis of the cost and bug-fixing performance, offering valuable insights for the practical deployment of our approach in real-world scenarios.

翻译：近期研究探索了大型语言模型在各种软件工程任务上的性能，例如代码生成和缺陷修复。然而，这些方法通常依赖于项目当前快照的上下文数据，忽略了现实世界软件仓库中丰富的历史数据潜力。此外，在历史背景下，提示风格对LLM在软件工程任务中性能的影响仍未得到充分探索。为弥补这些不足，我们提出了HAFix（即History-Augmented LLMs on Bug Fixing），这是一种新颖的方法，它利用了与缺陷相关的七种独立历史启发式方法，并聚合这些启发式方法的结果（HAFix-Agg）以增强LLM的缺陷修复能力。为实证评估HAFix，我们在BugsInPy的51个单行Python缺陷和Defects4J的116个单行Java缺陷上，使用了三种代码LLM（即Code Llama、DeepSeek-Coder和DeepSeek-Coder-V2-Lite模型）进行测试。我们的评估表明，与受GitHub Copilot启发的非历史基线相比，多种HAFix启发式方法均实现了统计上显著的改进。此外，聚合变体HAFix-Agg通过结合各启发式方法的互补优势，实现了大幅提升，相对于相应基线，在BugsInPy上的缺陷修复率平均提高了45.05%，在Defects4J上平均提高了49.92%。此外，在历史启发式方法的背景下，我们发现对于缺陷修复任务，Instruction提示风格相比InstructionLabel和InstructionMask是LLM最有效的模板。最后，我们评估了HAFix在推理时间和令牌使用方面的成本，并对成本与缺陷修复性能进行了实用的权衡分析，为该方法在实际场景中的部署提供了有价值的见解。