VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated function classifiers produce fragile and poorly calibrated warnings. Repository-level LLM agents can gather richer evidence, but prior variants under-specify reproducibility, verifier behavior, baseline fairness, and statistical uncertainty. We present VulnAgent-R2, a budget-aware agentic auditing framework with three additional reusable modules: counterfactual evidence reweighting, build-aware verification-plan synthesis, and a cost-risk Pareto scheduler. The system combines graph triage, bounded context optimization, role-specialized agents, sceptic counter-evidence, selective dynamic verification, and calibrated fusion. On Devign, Big-Vul, DiverseVul, and PrimeVul, VulnAgent-R2 obtains 0.798/0.895, 0.739/0.871, 0.700/0.842, and 0.385/0.781 F1/AUROC, respectively. On JITVul it reaches 0.606 F1, 0.529 Top-1, and 0.742 Top-3 localization, while reducing online tokens by 38.3\% over always-full multi-agent execution. Online time includes retrieval, LLM calls, CER scoring, verifier planning, compilation, and test execution, but excludes one-time shared indexing. Bootstrap tests show the PrimeVul gain over VulnAgent-X is +0.038 F1, 95\% CI [0.020, 0.055], Holm-adjusted $p=0.009$. Treating vulnerability detection as calibrated evidence accumulation improves detection, localization, auditability, and cost control under the evaluated protocol, while remaining a prioritization aid rather than a replacement for manual review.Code is available at https://github.com/renweimeng/Vlun-Agent-X.

翻译：软件漏洞往往依赖于跨文件数据流、构建选项、框架约定和运行时防护机制，因此孤立的函数级分类器会产生脆弱且校准不佳的预警。仓库级大语言模型智能体虽能收集更丰富的证据，但现有方案在可复现性、验证器行为、基线公平性和统计不确定性方面规范不足。本文提出VulnAgent-R2——一种预算感知的智能体审计框架，包含三个可复用模块：反事实证据加权、构建感知的验证规划生成，以及成本-风险帕累托调度器。该系统融合了图剪枝、有限上下文优化、角色专业化智能体、质疑性反证、选择性动态验证和校准融合。在Devign、Big-Vul、DiverseVul和PrimeVul数据集上，VulnAgent-R2的F1/AUROC分别达到0.798/0.895、0.739/0.871、0.700/0.842和0.385/0.781；在JITVul数据集上，其定位性能为0.606 F1、0.529 Top-1和0.742 Top-3，同时相较于始终采用全量多智能体执行的方案，在线令牌消耗减少38.3%。在线时间涵盖检索、大语言模型调用、CER评分、验证器规划、编译和测试执行，但排除一次性共享索引。Bootstrap检验显示，PrimeVul上相比VulnAgent-X的增益为+0.038 F1，95%置信区间[0.020, 0.055]，Holm校正后p=0.009。将漏洞检测视为校准证据积累过程，可在评估协议下提升检测、定位、可审计性和成本控制能力，同时始终作为优先级辅助工具而非人工审查的替代方案。代码开源地址：https://github.com/renweimeng/Vlun-Agent-X