Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.
翻译:现有基于大语言模型的攻防安全代理基准测试采用孤立的单目标设置,依赖已知脆弱服务和固定目标。这类基准虽能有效衡量漏洞利用能力,但忽略了真实夺旗赛参与者如何梳理未知攻击面、排序攻击目标及在不确定性下分配精力的过程。因此,当前评估无法考察超越单纯利用行为的战略推理能力。为解决这一问题,我们提出CTFExplorer基准套件,将攻防安全评估转向多目标场景,检验代理在探索、优先级排序和攻击链编排方面的能力。CTFExplorer在单一环境中部署40个基于Web的脆弱服务,要求代理在没有预设指引的情况下自主发现、识别并利用目标。我们还提出一种响应式多代理框架作为参考代理方案,并开发了与代理无关的评估框架,可记录结构化推理踪迹以实现细粒度评估。由此实现超越单一flag捕获的行为评估,例如代理如何管理目标选择、处理失败假设、跨阶段协调以及提取安全情报。