Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks. Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems. Precise vulnerability detection requires reasoning about the code, making it a good case study for exploring the limits of LLMs' reasoning capabilities. Although recent work has applied LLMs to vulnerability detection using generic prompting techniques, their full capabilities for this task and the types of errors they make when explaining identified vulnerabilities remain unclear. In this paper, we surveyed eleven LLMs that are state-of-the-art in code generation and commonly used as coding assistants, and evaluated their capabilities for vulnerability detection. We systematically searched for the best-performing prompts, incorporating techniques such as in-context learning and chain-of-thought, and proposed three of our own prompting methods. Our results show that while our prompting methods improved the models' performance, LLMs generally struggled with vulnerability detection. They reported 0.5-0.63 Balanced Accuracy and failed to distinguish between buggy and fixed versions of programs in 76% of cases on average. By comprehensively analyzing and categorizing 287 instances of model reasoning, we found that 57% of LLM responses contained errors, and the models frequently predicted incorrect locations of buggy code and misidentified bug types. LLMs only correctly localized 6 out of 27 bugs in DbgBench, and these 6 bugs were predicted correctly by 70-100% of human participants. These findings suggest that despite their potential for other tasks, LLMs may fail to properly comprehend critical code structures and security-related concepts. Our data and code are available at https://figshare.com/s/78fe02e56e09ec49300b.
翻译:大型语言模型(LLMs)在代码生成及其他软件工程任务中展现出巨大潜力。漏洞检测对于维护软件系统的安全性、完整性和可信度至关重要。精确的漏洞检测需要对代码进行推理,因此成为探索LLMs推理能力极限的理想案例。尽管近期已有研究将LLMs应用于漏洞检测,并采用通用提示技术,但其在此任务上的全部能力及在解释已识别漏洞时产生的错误类型仍不明确。本文对11种在代码生成领域处于最前沿且常被用作编程助手的LLMs进行调研,评估了它们在漏洞检测方面的能力。我们系统搜索了表现最优的提示方法,融入上下文学习和思维链等技术,并提出了三种自定义提示策略。结果表明,尽管我们的提示方法提升了模型性能,但LLMs在漏洞检测上整体表现欠佳。其平衡准确率仅为0.5-0.63,且在平均76%的案例中无法区分程序的缺陷版本与修复版本。通过对287个模型推理实例进行综合分析与分类,我们发现57%的LLM响应存在错误,且模型频繁预测错误的缺陷代码位置并误判缺陷类型。LLMs在DbgBench中仅正确定位了27个缺陷中的6个,而这6个缺陷均被70-100%的人类参与者正确预测。这些发现表明,尽管LLMs在其他任务中具有潜力,但其可能无法恰当理解关键代码结构与安全相关概念。我们的数据与代码详见https://figshare.com/s/78fe02e56e09ec49300b。