Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.
翻译:大语言模型智能体越来越多地被提出用于自主网络安全任务,但它们在现实攻击场景中的能力仍然缺乏深入了解。我们提出DeepRed,这是一个开源基准测试,用于在隔离虚拟化环境中的现实夺旗挑战中评估基于大语言模型的智能体。DeepRed将智能体置于包含终端工具和可选网络搜索的Kali攻击者环境中,通过专用网络连接到目标挑战,并记录完整执行轨迹以供分析。为了超越二元解决/未解决结果,我们引入了一种基于公开解题报告中挑战特定检查点的部分评分方法,并配以自动化的"摘要-判断"标注流程,用于从日志中分配检查点完成情况。利用DeepRed,我们在涵盖不同挑战类别的十个基于虚拟机的夺旗挑战上对十个商业可用的大语言模型进行了基准测试。结果表明当前智能体能力仍然有限:最佳模型平均检查点完成率仅为35%,在常见挑战类型上表现最强,而在需要非标准发现和更长周期适应的任务上表现最弱。