Most vulnerability detection studies focus on datasets of vulnerabilities in C/C++ code, offering limited language diversity. Thus, the effectiveness of deep learning methods, including large language models (LLMs), in detecting software vulnerabilities beyond these languages is still largely unexplored. In this paper, we evaluate the effectiveness of LLMs in detecting and classifying Common Weakness Enumerations (CWE) using different prompt and role strategies. Our experimental study targets six state-of-the-art pre-trained LLMs (GPT-3.5- Turbo, GPT-4 Turbo, GPT-4o, CodeLLama-7B, CodeLLama- 13B, and Gemini 1.5 Pro) and five programming languages: Python, C, C++, Java, and JavaScript. We compiled a multi-language vulnerability dataset from different sources, to ensure representativeness. Our results showed that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting. Aside from the quantitative results of our study, we developed a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis in real-world security scenarios. We have evaluated CODEGUARDIAN with a user study involving 22 developers from the industry. Our study showed that, by using CODEGUARDIAN, developers are more accurate and faster at detecting vulnerabilities.
翻译:多数漏洞检测研究集中于C/C++代码漏洞数据集,语言多样性有限。因此,深度学习(包括大语言模型)在检测这些语言之外的软件漏洞方面的有效性仍很大程度上未被探索。本文通过不同提示与角色策略,评估了大语言模型在检测与分类通用缺陷枚举(CWE)方面的有效性。我们的实验研究针对六种最先进的预训练大语言模型(GPT-3.5-Turbo、GPT-4 Turbo、GPT-4o、CodeLLama-7B、CodeLLama-13B和Gemini 1.5 Pro)以及五种编程语言:Python、C、C++、Java和JavaScript。我们从不同来源汇编了一个多语言漏洞数据集,以确保代表性。结果表明,在少样本设置下,GPT-4o取得了最高的漏洞检测与CWE分类分数。除定量研究结果外,我们还开发了一个名为CODEGUARDIAN的库,该库与VSCode集成,使开发者能够在实际安全场景中执行大语言模型辅助的实时漏洞分析。我们通过一项涉及22名行业开发者的用户研究对CODEGUARDIAN进行了评估。研究表明,使用CODEGUARDIAN后,开发者在漏洞检测方面更准确、更快速。