We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.
翻译:我们提出了网络安全防御基准(Cyber Defense Benchmark),用于衡量大语言模型(LLM)智能体在执行安全运营中心(SOC)分析师核心任务——威胁狩猎时的表现:给定一个包含原始Windows事件日志的数据库,在无引导问题或提示的情况下,识别恶意事件的确切时间戳。该基准将OTRF安全数据集语料库中的106个真实攻击过程(涵盖86种MITRE ATT&CK子技术,分布于12种战术)封装于Gymnasium强化学习环境中。每个场景向智能体提供一个内存SQLite数据库,包含75,000-135,000条日志记录,这些记录由确定性战役模拟器生成,该模拟器对原始记录进行时间偏移和实体混淆。智能体必须迭代提交SQL查询以发现恶意事件时间戳并显式标记,采用CTF评分模式与基于Sigma规则的真实标注进行对比。我们评估了五种前沿模型——Claude Opus 4.6、GPT-5、Gemini 3.1 Pro、Kimi K2.5和Gemini 3 Flash——在覆盖105/106个程序的26个战役上的表现,发现所有模型均严重失效:表现最佳的模型(Claude Opus 4.6)平均仅正确标记3.8%的恶意事件,且没有任何模型在任意运行中完成全部标记。我们将通过标准定义为在每个ATT&CK战术上召回率≥50%——这是无监督SOC部署的最低门槛。没有模型达到该标准:最佳模型在13个战术中有5个达标,其余四个模型全部未达标。这些结果表明,尽管当前LLM在策划式问答安全基准中表现强劲,但在开放式、证据驱动的威胁狩猎任务中存在严重不足。