Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

翻译：尽管自动化漏洞检测技术在检测安全漏洞方面取得了显著进展，但其可扩展性和适用性仍面临挑战。大型语言模型（LLMs）（如GPT-4和CodeLlama）在代码相关任务上的卓越表现，促使近期研究开始探索LLMs是否可用于漏洞检测。本文通过同时考察更多数据集、编程语言和LLMs，定性评估不同提示词和漏洞类别的性能表现，并针对现有工具的不足，开展了更全面的研究。具体而言，我们评估了16个预训练LLMs在来自五个不同安全数据集的5000个代码样本上的有效性。这些平衡的数据集涵盖了Java和C/C++的合成项目与真实项目，包含25个不同的漏洞类别。总体而言，所有规模和系列的LLMs在检测漏洞方面均表现出中等有效性，在跨数据集上平均准确率达到62.8%，F1分数为0.71。它们在检测仅需过程内分析的漏洞（如操作系统命令注入和空指针解引用）方面表现显著更优。此外，在这些漏洞上，它们报告的准确率高于CodeQL等流行的静态分析工具。我们发现，涉及逐步分析的高级提示策略能显著提升LLMs在真实世界数据集上的F1分数性能（平均提升高达0.18）。有趣的是，我们观察到LLMs在正确执行部分分析任务方面展现出潜力，例如识别漏洞相关规范，以及利用自然语言信息理解代码行为（例如检查代码是否经过净化处理）。我们期望这些见解能为未来基于LLM增强的漏洞检测系统研究提供指导。