Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose \textbf{MulVul}, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emph{Router} agent first predicts the top-$k$ coarse categories and then forwards the input to specialized \emph{Detector} agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emph{Cross-Model Prompt Evolution}, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79\% Macro-F1, outperforming the best baseline by 41.5\%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6\% over manual prompts by effectively handling diverse vulnerability patterns.
翻译:大型语言模型(LLM)在自动化现实世界漏洞检测方面面临两大关键挑战:漏洞模式的异构性削弱了单一统一模型的有效性,且针对海量弱点类别的手动提示工程难以扩展。为解决这些问题,本文提出\textbf{MulVul},一种基于检索增强的多智能体框架,旨在实现精确且广覆盖的漏洞检测。MulVul采用由粗到精的策略:首先由\textit{路由}智能体预测前$k$个粗粒度类别,随后将输入转发至专用的\textit{检测}智能体以识别具体的漏洞类型。两类智能体均配备检索工具,能够主动从漏洞知识库中获取证据以缓解幻觉问题。关键的是,为实现专用提示的自动化生成,我们设计了\textit{跨模型提示演化}机制——一种提示优化方法,其中生成器LLM迭代优化候选提示,而独立的执行器LLM则验证其有效性。这种解耦设计缓解了单模型优化中固有的自我修正偏差。在130种CWE类型上的评估表明,MulVul实现了34.79\%的宏平均F1分数,较最佳基线方法提升了41.5\%。消融研究验证了跨模型提示演化的有效性,其通过有效处理多样化的漏洞模式,相比手动提示将性能提升了51.6\%。