The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE-399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks.
翻译:大型语言模型(LLM)的快速发展为自动化软件漏洞检测带来了新的机遇,这是保障现代代码库安全的关键任务。本文对基于LLM的软件漏洞检测技术效果进行了比较研究。该研究评估了三种方法——检索增强生成(RAG)、监督微调(SFT)以及双智能体LLM框架,并以一个基线LLM模型作为对照。研究数据集精选自Big-Vul和GitHub的真实代码仓库,重点关注五个关键的通用缺陷枚举(CWE)类别:CWE-119、CWE-399、CWE-264、CWE-20和CWE-200。我们的RAG方法整合了来自互联网和MITRE CWE数据库的外部领域知识,取得了最高的总体准确率(0.86)和F1分数(0.85),凸显了上下文增强的价值。我们采用参数高效的QLoRA适配器实现的SFT方法也表现出色。我们的双智能体系统——一种由次级智能体审核并优化初级智能体输出的架构——在提升推理透明度和错误缓解方面展现出潜力,同时降低了资源开销。这些结果表明,融入领域专业知识机制能显著增强LLM在现实世界漏洞检测任务中的实际适用性。