DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.

翻译：自动化程序修复领域多年来吸引了大量关注，但尽管研究投入巨大，构建一个能有效处理复杂语义缺陷（如安全漏洞）的系统仍被证明颇具挑战性。解决这一挑战的一个有前景的方向是利用大型语言模型，这些模型正越来越多地被用于解决各类编程任务。在本文中，我们研究了大型语言模型在代码修复任务中的有效性。我们表明，该任务难度较大，因为它要求模型学习长距离代码关系，这一任务本质上依赖于大量训练数据。同时，为复杂程序缺陷及其对应修复构建一个大规模、干净的数据库并非易事。我们提出了一种技术，通过一种查询和微调大型语言模型的新方法来应对这些挑战。其核心思想是利用程序分析来限制语言模型注意力机制仅关注执行修复所需的代码部分，从而大幅减少所需的训练数据量。具体而言，在训练和推理过程中，我们不再将整个程序输入语言模型，而是将其代码缩减为一段更短的代码片段，该片段包含所报告的缺陷及必要的上下文，并以此替代完整代码。我们的评估表明，这种代码缩减方法显著改进了现有模型（如使用少样本学习的GPT-4）以及微调后的模型。为训练和评估我们的系统，我们通过广泛标注156种缺陷模式（包括40条安全规则）创建了一个全面的代码修复数据集，这些模式需要复杂的跨过程数据流分析才能发现。我们使用Mixtral-8x7B的最佳系统能移除超过80%的报告缺陷，同时在10%至50%的案例中与人工修复完全匹配，性能优于基于GPT-3.5和GPT-4的基线模型，以及基于窗口的模型（如TFix）。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日