Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.
翻译:生成式人工智能的最新进展使得大语言模型在软件工程领域得到广泛应用,解决了诸多长期存在的挑战。然而,目前尚缺乏一项全面研究来考察大语言模型在软件安全关键环节——软件漏洞检测中的能力。现有研究主要集中于使用C/C++数据集评估大语言模型,通常仅针对开源大语言模型探索提示工程、指令微调和序列分类微调这三种策略中的一至两种。因此,关于不同大语言模型在多种编程语言漏洞检测中的有效性存在显著认知空白。为填补这一空白,我们开展了一项全面实证研究,评估大语言模型在软件漏洞检测任务中的性能。我们构建了包含8,260个Python漏洞函数、7,505个Java漏洞函数和28,983个JavaScript漏洞函数的综合数据集。通过提示工程、指令微调和序列分类微调等多种方法,我们对五个开源大语言模型进行了评估。这些大语言模型的基准测试结果与五个经过微调的小型语言模型及两种开源静态应用安全测试工具进行了对比。此外,我们探索了提升大语言模型软件漏洞检测性能的两条路径:a) 数据层面:使用下采样平衡数据集重新训练模型;b) 模型层面:研究融合多个大语言模型预测结果的集成学习方法。我们的综合实验表明,软件漏洞检测对大语言模型而言仍是具有挑战性的任务。本研究深入揭示了大语言模型在软件漏洞检测中的作用,并为未来利用生成式人工智能提升软件安全实践提供了切实可行的见解。