This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.
翻译:[translated abstract in Chinese]
本文提出了AuditBench基准数据集,用于评估大语言模型(LLMs)在分析安全相关系统审计日志方面的能力。我们利用该基准系统性地探索了LLMs在应急响应团队常用的四项日志调查任务上的表现,涵盖从检测告警筛选到被攻陷系统持久化机制识别等场景。AuditBench包含从Linux和Windows机器收集的系统审计日志,覆盖50余种安全调查场景(包含恶意与非恶意行为)。通过该基准,我们评估并分析了五种前沿LLMs在攻击调查中分析审计日志的性能。研究揭示了LLMs的性能与错误分布如何受不同设计选择(如模型规模、数据表示方式、提示构建策略及具体调查任务差异)的影响。此外,我们描述了LLMs生成解释的质量特性及其在基准测试中的典型错误类型。总体而言,本研究为评估LLMs在安全日志调查中的能力奠定了基础,为安全运营人员使用LLMs提供了创新性实践洞见,并为未来研究指出了重要方向。