AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.

翻译：前沿AI系统在网络安全任务中的能力日益增强，包括代码库审计、漏洞检测与利用。然而，由于缺乏开放、可复现的多主机网络靶场，对其攻击能力的评估仍受限于现有公开基准。当前基准测试多聚焦于CTF解题、漏洞复现与利用生成等孤立技能，却往往剥离了真实的入侵工作流：发现暴露服务、建立立足点、收集内部信息、跨主机扩大攻击面。这种缺失导致难以早期观测到新兴风险，因为前沿AI系统极少在真实攻击条件下接受评估。我们提出AgentCyberRange——首个用于在真实网络靶场中测量自主网络攻击能力的开放多靶场基础设施。该基准整合了横跨15个真实Web应用与8个企业级网络靶场（含156个内部主机）的110个漏洞，并配备Cage工具链实现执行、编排、结果采集与验证。基准测试覆盖两个核心阶段：Web利用阶段（智能体探索暴露应用并验证漏洞）与后利用阶段（智能体将初始立足点转化为更深层的内部入侵）。我们在匹配提示词与预算的条件下评估了六种前沿AI系统。配备Codex的GPT-5.5表现最佳，完成16.1%的Web利用任务与31.7%的后利用任务；在增加具体提示后，完成率分别提升至33.0%与46.3%。此外，我们还观察到基准范围外的发现，包括热门项目中的未知漏洞，以及可绕过主机防御的有效载荷变异。这些结果表明，开放网络靶场评估对于在真实且可复现条件下观测新兴攻击能力至关重要。