Large Language Models (LLMs) are increasingly adopted for vulnerability detection, yet their reasoning remains fundamentally unsound. We identify a root cause shared by both major mitigation paradigms (agent-based debate and retrieval augmentation): reasoning in an ungrounded deliberative space that lacks a bounded, hypothesis-specific evidence base. Without such grounding, agents fabricate cross-function dependencies, and retrieval heuristics supply generic knowledge decoupled from the repository's data-flow topology. Consequently, the resulting conclusions are driven by rhetorical persuasiveness rather than verifiable facts. To ground this deliberation, we present AEGIS, a novel multi-agent framework that shifts detection from ungrounded speculation to forensic verification over a closed factual substrate. Guided by a "From Clue to Verdict" philosophy, AEGIS first identifies suspicious code anomalies (clues), then dynamically reconstructs per-variable dependency chains for each clue via on-demand slicing over a repository-level Code Property Graph. Within this closed evidence boundary, a Verifier Agent constructs competing dialectical arguments for and against exploitability, while an independent Audit Agent scrutinizes every claim against the trace, exercising veto power to prevent hallucinated verdicts. Evaluation on the rigorous PrimeVul dataset demonstrates that AEGIS establishes a new state-of-the-art, achieving 122 Pair-wise Correct Predictions. To our knowledge, this is the first approach to surpass 100 on this benchmark. It reduces the false positive rate by up to 54.40% compared to leading baselines, at an average cost of $0.09 per sample without any task-specific training.
翻译:大型语言模型(LLMs)日益被用于漏洞检测,但其推理过程从根本上存在不可靠性。我们识别出两种主要缓解范式(基于代理的辩论与检索增强)共享的根本原因:在缺乏有界、假设特异性证据基础的未接地协商空间中进行推理。缺乏这种接地,代理会编造跨函数依赖关系,而检索启发式方法提供的通用知识又与仓库的数据流拓扑相脱节。因此,得出的结论由修辞说服力而非可验证事实驱动。为夯实这种协商,我们提出AEGIS,一种新型多代理框架,将检测从未接地推测转变为对封闭事实基底的取证验证。遵循“从线索到裁决”的哲学,AEGIS首先识别可疑代码异常(线索),然后通过根据需求在仓库级代码属性图上进行切片,为每条线索动态重构每个变量的依赖链。在此封闭证据边界内,验证代理构建支持与反对可利用性的竞争性辩证论点,同时独立的审计代理根据跟踪追溯审查每项声明,行使否决权以防止产生幻觉化的裁决。在严格的PrimeVul数据集上的评估表明,AEGIS建立了新的最先进水平,实现了122对正确预测。据我们所知,这是首个在该基准上超过100的方法。与领先基线相比,它将误报率最多降低54.40%,且每个样本的平均成本为0.09美元,无需任何特定任务训练。