大语言模型在多语言软件漏洞检测中的基准测试 (Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection)

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

翻译：生成式人工智能的最新进展使得大语言模型在软件工程领域得到广泛应用，解决了诸多长期存在的挑战。然而，目前尚缺乏一项全面研究来考察大语言模型在软件安全关键环节——软件漏洞检测中的能力。现有研究主要集中于使用C/C++数据集评估大语言模型，通常仅针对开源大语言模型探索提示工程、指令微调和序列分类微调这三种策略中的一至两种。因此，关于不同大语言模型在多种编程语言漏洞检测中的有效性存在显著认知空白。为填补这一空白，我们开展了一项全面实证研究，评估大语言模型在软件漏洞检测任务中的性能。我们构建了包含8,260个Python漏洞函数、7,505个Java漏洞函数和28,983个JavaScript漏洞函数的综合数据集。通过提示工程、指令微调和序列分类微调等多种方法，我们对五个开源大语言模型进行了评估。这些大语言模型的基准测试结果与五个经过微调的小型语言模型及两种开源静态应用安全测试工具进行了对比。此外，我们探索了提升大语言模型软件漏洞检测性能的两条路径：a) 数据层面：使用下采样平衡数据集重新训练模型；b) 模型层面：研究融合多个大语言模型预测结果的集成学习方法。我们的综合实验表明，软件漏洞检测对大语言模型而言仍是具有挑战性的任务。本研究深入揭示了大语言模型在软件漏洞检测中的作用，并为未来利用生成式人工智能提升软件安全实践提供了切实可行的见解。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日