The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.
翻译:自动化程序修复领域多年来吸引了大量关注,但尽管研究投入巨大,构建一个能有效处理复杂语义缺陷(如安全漏洞)的系统仍被证明颇具挑战性。解决这一挑战的一个有前景的方向是利用大型语言模型,这些模型正越来越多地被用于解决各类编程任务。在本文中,我们研究了大型语言模型在代码修复任务中的有效性。我们表明,该任务难度较大,因为它要求模型学习长距离代码关系,这一任务本质上依赖于大量训练数据。同时,为复杂程序缺陷及其对应修复构建一个大规模、干净的数据库并非易事。我们提出了一种技术,通过一种查询和微调大型语言模型的新方法来应对这些挑战。其核心思想是利用程序分析来限制语言模型注意力机制仅关注执行修复所需的代码部分,从而大幅减少所需的训练数据量。具体而言,在训练和推理过程中,我们不再将整个程序输入语言模型,而是将其代码缩减为一段更短的代码片段,该片段包含所报告的缺陷及必要的上下文,并以此替代完整代码。我们的评估表明,这种代码缩减方法显著改进了现有模型(如使用少样本学习的GPT-4)以及微调后的模型。为训练和评估我们的系统,我们通过广泛标注156种缺陷模式(包括40条安全规则)创建了一个全面的代码修复数据集,这些模式需要复杂的跨过程数据流分析才能发现。我们使用Mixtral-8x7B的最佳系统能移除超过80%的报告缺陷,同时在10%至50%的案例中与人工修复完全匹配,性能优于基于GPT-3.5和GPT-4的基线模型,以及基于窗口的模型(如TFix)。