FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

from arxiv, 18 pages, 4 figures, 3 tables. Accepted at the AgentCy Workshop at the 21st International Conference on Availability, Reliability and Security (ARES 2026). Keywords: Vulnerability assessment, Multi-agent systems, Exploit generation, Detection engineering, Risk prioritization

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

翻译：摘要：当前漏洞披露数量远超组织评估能力，然而三个相邻研究领域（概念验证生成、漏洞优先级排序和检测规则工程）基本处于相互孤立状态。现有的自动化利用生成系统仅报告二元通过/失败结果，丢弃了部分进展信息，也未为其他两个领域提供信号。本文提出FORGE——一个通过渐进式利用深度连接这三个孤立领域的多智能体系统。五个专用智能体（情报、生成器、规划器、利用和检测器）在固定流水线中执行，该系统能够：(1) 从CVE元数据生成针对性脆弱应用，(2) 开展由LLM主裁判在四级分类体系（L0：无证据至L3：完全攻陷）下评估的引导式多轮利用过程，(3) 基于OpenTelemetry利用踪迹生成Sigma和Snort检测规则。渐进式深度是连接机制：更深入的利用为检测工程提供更丰富的行为轨迹，而跨评分区间的深度数据为优先级排序验证提供基准真值。分层知识架构在评估过程中积累情报，将构建和利用经验迁移至后续CVE。在CVE-GENIE数据集的603个CVE上评估结果表明，该系统在八种编程语言和187种CWE类型中实现了67.8%的端到端L1+利用成功率，每CVE成本为1.50美元。无论EPSS或CVSS评分区间如何，利用成功率均保持在约68%，表明模式级可达性与基于元数据的优先级排序正交。基于L2+利用的检测规则相比L1衍生规则实现了显著更高的跨度归一化定位精度（p=0.035），且93.4%的生成Snort规则对合成良性语料库产生零误报。