Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

翻译：我们提出网络防御基准（Cyber Defense Benchmark），用于评估大语言模型（LLM）主体在执行核心SOC分析师任务——威胁狩猎——时的能力：给定一个包含原始Windows事件日志的数据库，且无引导性问题或提示，要求主体精准定位恶意事件的时间戳。该基准将OTRF安全数据集语料库中106个真实攻击流程（涵盖86种MITRE ATT&CK子技术及12种战术）封装至Gymnasium强化学习环境中。每个回合中，主体将面对一个内存型SQLite数据库（含75,000至135,000条日志记录），这些记录由确定性攻击模拟器通过时间平移与实体混淆原始事件序列生成。主体需通过迭代提交SQL查询发现恶意事件时间戳并显式标记，采用CTF计分模式与基于Sigma规则的真相基准进行比对。我们评估了五款前沿模型——Claude Opus 4.6、GPT-5、Gemini 3.1 Pro、Kimi K2.5及Gemini 3 Flash——在覆盖105/106个攻击程序的26个攻击场景中的表现，发现所有模型均出现严重性能失效：最优模型（Claude Opus 4.6）平均仅正确标记3.8%的恶意事件，且所有模型的任意运行实例均未能完成全部标记。我们将通过分数定义为对每个ATT&CK战术的召回率≥50%——即非监督SOC部署的最低门槛。然而无模型达标：榜首模型仅跨越13个战术中的5个，其余四款模型则全军覆没。这些结果表明，尽管当前LLM在精心设计的问答式安全基准测试中表现优异，但其在开放式的证据驱动型威胁狩猎任务中仍严重不达标。