PenForge：面向自动化渗透测试的即时专家智能体构建 (PenForge: On-the-Fly Expert Agent Construction for Automated Penetration Testing)

Penetration testing is essential for identifying vulnerabilities in web applications before real adversaries can exploit them. Recent work has explored automating this process with Large Language Model (LLM)-powered agents, but existing approaches either rely on a single generic agent that struggles in complex scenarios or narrowly specialized agents that cannot adapt to diverse vulnerability types. We therefore introduce PenForge, a framework that dynamically constructs expert agents during testing rather than relying on those prepared beforehand. By integrating automated reconnaissance of potential attack surfaces with agents instantiated on the fly for context-aware exploitation, PenForge achieves a 30.0% exploit success rate (12/40) on CVE-Bench in the particularly challenging zero-day setting, which is a 3 times improvement over the state-of-the-art. Our analysis also identifies three opportunities for future work: (1) supplying richer tool-usage knowledge to improve exploitation effectiveness; (2) extending benchmarks to include more vulnerabilities and attack types; and (3) fostering developer trust by incorporating explainable mechanisms and human review. As an emerging result with substantial potential impact, PenForge embodies the early-stage yet paradigm-shifting idea of on-the-fly agent construction, marking its promise as a step toward scalable and effective LLM-driven penetration testing.

翻译：渗透测试对于在真实攻击者利用之前识别Web应用程序中的漏洞至关重要。近期研究探索了利用大型语言模型驱动的智能体自动化此过程，但现有方法要么依赖单一通用智能体（在复杂场景中表现不佳），要么依赖高度专业化的智能体（无法适应多样化的漏洞类型）。为此，我们提出了PenForge框架，该框架在测试过程中动态构建专家智能体，而非依赖预先准备的智能体。通过将潜在攻击面的自动化侦察与即时实例化的上下文感知利用智能体相结合，PenForge在极具挑战性的零日场景下，于CVE-Bench上实现了30.0%的利用成功率（12/40），这比现有最优方法提升了3倍。我们的分析还指出了未来工作的三个方向：（1）提供更丰富的工具使用知识以提高利用有效性；（2）扩展基准测试以涵盖更多漏洞和攻击类型；（3）通过引入可解释机制和人工审核来增强开发者信任。作为一项具有重大潜在影响的新兴成果，PenForge体现了即时智能体构建这一处于早期阶段但具有范式转换意义的理念，标志着其作为迈向可扩展且有效的大型语言模型驱动渗透测试的重要一步。