Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof-of-concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE's generalizability through a case study on a recent real-world vulnerability not included in CVE-Bench.
翻译:漏洞检测工具在软件项目中已被广泛采用,但其产生的误报和不可操作报告常常使维护者不堪重负。自动化利用系统有助于验证这些报告;然而,现有方法通常独立于检测流水线运行,未能利用漏洞类型和源代码位置等易于获取的元数据。本文研究了如何在利用最少漏洞元数据(具体为CWE分类和漏洞代码位置)的现实灰盒利用场景中评估已报告的安全漏洞。我们提出了智能利用引擎(Agentic eXploit Engine, AXE),这是一个用于Web应用程序利用的多智能体框架,它通过解耦的规划、代码探索和动态执行反馈,将轻量级检测元数据映射到具体的利用方案。在CVE-Bench数据集上的评估表明,AXE实现了30%的利用成功率,相比最先进的基线方法提升了3倍。即使在单智能体配置下,灰盒元数据也能带来1.75倍的性能提升。系统性的错误分析表明,大多数失败尝试源于特定的推理缺陷,包括对漏洞语义的误解以及未满足的执行前提条件。对于成功的利用,AXE能生成可操作、可复现的概念验证工件,证明了其在简化Web漏洞分类与修复流程中的实用性。我们通过一项针对CVE-Bench中未包含的近期真实世界漏洞的案例研究,进一步评估了AXE的泛化能力。