A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection

Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear. In this paper, we surveyed eleven LLMs that are state-of-the-art in code generation and commonly used as coding assistants, and evaluated their capabilities for vulnerability detection. We systematically searched for the best-performing prompts, incorporating techniques such as in-context learning and chain-of-thought, and proposed three of our own prompting methods. Our results show that while our prompting methods improved the models' performance, LLMs generally struggled with vulnerability detection. They reported 0.5-0.63 Balanced Accuracy and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average. By comprehensively analyzing and categorizing 287 instances of model reasoning, we found that 57% of LLM responses contained errors, and the models frequently predicted incorrect locations of buggy code and misidentified bug types. LLMs only correctly localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted correctly by 70-100% of human participants. These findings suggest that despite their potential for other tasks, LLMs may fail to properly comprehend critical code structures and security-related concepts. Our data and code are available at https://figshare.com/s/78fe02e56e09ec49300b.

翻译：大型语言模型（LLMs）在代码生成及其他软件工程任务中展现出巨大潜力。漏洞检测对于维护软件系统的安全性、完整性和可信度至关重要。精确的漏洞检测需要对代码进行推理，因此成为探索LLMs推理能力极限的理想案例。尽管近期已有研究将LLMs应用于漏洞检测，并采用通用提示技术，但其在此任务上的全部能力及在解释已识别漏洞时产生的错误类型仍不明确。本文对11种在代码生成领域处于最前沿且常被用作编程助手的LLMs进行调研，评估了它们在漏洞检测方面的能力。我们系统搜索了表现最优的提示方法，融入上下文学习和思维链等技术，并提出了三种自定义提示策略。结果表明，尽管我们的提示方法提升了模型性能，但LLMs在漏洞检测上整体表现欠佳。其平衡准确率仅为0.5-0.63，且在平均76%的案例中无法区分程序的缺陷版本与修复版本。通过对287个模型推理实例进行综合分析与分类，我们发现57%的LLM响应存在错误，且模型频繁预测错误的缺陷代码位置并误判缺陷类型。LLMs在DbgBench中仅正确定位了27个缺陷中的6个，而这6个缺陷均被70-100%的人类参与者正确预测。这些发现表明，尽管LLMs在其他任务中具有潜力，但其可能无法恰当理解关键代码结构与安全相关概念。我们的数据与代码详见https://figshare.com/s/78fe02e56e09ec49300b。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日