Security incident analysis (SIA) poses a major challenge for security operations centers, which must manage overwhelming alert volumes, large and diverse data sources, complex toolchains, and limited analyst expertise. These difficulties intensify because incidents evolve dynamically and require multi-step, multifaceted reasoning. Although organizations are eager to adopt Large Language Models (LLMs) to support SIA, the absence of rigorous benchmarking creates significant risks for assessing their effectiveness and guiding design decisions. Benchmarking is further complicated by: (i) the lack of an LLM-ready dataset covering a wide spectrum of SIA tasks; (ii) the continual emergence of new tasks reflecting the diversity of analyst responsibilities; and (iii) the rapid release of new LLMs that must be incorporated into evaluations. In this paper, we address these challenges by introducing SIABENCH, an agentic evaluation framework for security incident analysis. First, we construct a first-of-its-kind dataset comprising two major SIA task categories: (i) deep analysis workflows for security incidents (25 scenarios) and (ii) alert-triage tasks (135 scenarios). Second, we implement an agent capable of autonomously performing a broad spectrum of SIA tasks (including network and memory forensics, malware analysis across binary/code/PDF formats, phishing email and kit analysis, log analysis, and false-alert detection). Third, we benchmark 11 major LLMs (spanning both open- and closed-weight models) on these tasks, with extensibility to support emerging models and newly added analysis scenarios.
翻译:安全事件分析(SIA)对安全运营中心构成了重大挑战,其必须处理海量的警报、庞大且多样化的数据源、复杂的工具链以及有限的分析师专业知识。由于事件动态演变且需要多步骤、多方面的推理,这些困难进一步加剧。尽管各组织渴望采用大型语言模型(LLMs)来支持SIA,但缺乏严格的基准测试为评估其有效性和指导设计决策带来了重大风险。基准测试的复杂性还体现在:(i)缺乏一个覆盖广泛SIA任务的、适用于LLM的数据集;(ii)反映分析师职责多样性的新任务不断涌现;以及(iii)必须纳入评估的新LLM模型快速发布。在本文中,我们通过引入SIABENCH——一个用于安全事件分析的智能体评估框架——来应对这些挑战。首先,我们构建了一个首创的数据集,包含两大SIA任务类别:(i)安全事件的深度分析工作流(25个场景)和(ii)警报分诊任务(135个场景)。其次,我们实现了一个能够自主执行广泛SIA任务的智能体(包括网络与内存取证、跨二进制/代码/PDF格式的恶意软件分析、钓鱼邮件与工具包分析、日志分析以及误报检测)。第三,我们基于这些任务对11个主要LLM(涵盖开放权重和封闭权重模型)进行了基准测试,该框架具有可扩展性,能够支持新兴模型和新添加的分析场景。