Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
翻译:大型语言模型(LLMs)的训练语料库包含大量程序代码,极大地提升了模型的代码理解与生成能力。然而,对于检测程序漏洞这一与代码相关的更具体任务,以及评估LLMs在此更专业化场景中的性能,目前仍缺乏全面深入的研究。为应对漏洞分析中的常见挑战,本研究引入了一个新的基准测试集VulDetectBench,专门用于评估LLMs的漏洞检测能力。该基准通过五个难度递增的任务,全面评估LLM在识别、分类和定位漏洞方面的能力。我们对17个开源及闭源模型进行了性能评估,发现现有模型在漏洞识别与分类相关任务上虽可达到超过80%的准确率,但在具体、更细致的漏洞分析任务上仍显不足,准确率低于30%,难以为专业漏洞挖掘提供有价值的辅助信息。我们的基准测试集能有效评估各类LLM在漏洞检测这一特定任务中不同层次的能力,为未来这一代码安全关键领域的研究与改进奠定基础。VulDetectBench已在https://github.com/Sweetaroo/VulDetectBench公开提供。