What Makes a Good LLM Agent for Real-world Penetration Testing?

LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval-augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration-exploitation decisions within an Evidence-Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty-aware planning yields consistent end-to-end gains across models and addresses a limitation that model scaling alone does not eliminate.

翻译：基于大语言模型（LLM）的智能体在自动化渗透测试方面展现出潜力，但不同系统和基准测试中报告的性能差异巨大。我们分析了28个基于LLM的渗透测试系统，并在三个复杂度递增的基准测试中评估了五个代表性实现。我们的分析揭示了两种不同的失败模式：A类失败源于能力差距（工具缺失、提示不足），这类问题可通过工程手段轻松解决；而B类失败则因规划和状态管理的局限性，无论工具配置如何都会持续存在。我们发现，B类失败共享一个根本原因，该原因在很大程度上与底层LLM无关：智能体缺乏实时任务难度评估能力。因此，智能体错误分配精力，过度投入低价值分支，并在完成攻击链之前耗尽上下文容量。基于这一洞见，我们提出了Excalibur——一个将强大工具集与难度感知规划相结合的渗透测试智能体。其工具与技能层通过类型化接口和检索增强知识消除了A类失败；任务难度评估（TDA）机制通过四个可量化维度（范围估计、证据置信度、上下文负载和历史成功率）评估任务可解性，并利用这些评估结果在证据引导的攻击树搜索（EGATS）框架内指导探索-利用决策。Excalibur在CTF基准测试中使用前沿模型实现了高达91%的任务完成率（相对基线提升39%至49%），并在GOAD Active Directory环境中成功入侵5台主机中的4台，而先前系统仅能入侵2台。这些结果表明，难度感知规划能在不同模型间带来一致的端到端性能提升，并解决了仅靠模型缩放无法消除的局限性。