We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points -- the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.
翻译:本文提出MultiVer,一种用于漏洞检测的零样本多智能体系统,无需微调即可实现最先进的召回率。通过采用四智能体集成(安全、正确性、性能、风格)并结合联合投票机制,该系统在PyVul数据集上达到82.7%的召回率,超过微调后的GPT-3.5模型(81.3%)1.4个百分点——这是首个在该基准测试中超越微调模型性能的零样本系统。在SecurityEval数据集上,相同架构实现了91.7%的检测率,与专用系统性能相当。召回率的提升以精确度为代价:系统精确度为48.8%,而微调基线为63.9%,最终F1分数为61.4%。消融实验明确了各组件贡献:多智能体集成相比单智能体安全分析带来17个百分点的召回率提升。这些结果表明,在漏报代价高于误报的安全应用场景中,零样本多智能体集成能够在最关键的性能指标上达到甚至超越微调模型。