CAM-LDS: Cyber Attack Manifestations for Automatic Interpretation of System Logs and Security Alerts

Log data are essential for intrusion detection and forensic investigations. However, manual log analysis is tedious due to high data volumes, heterogeneous event formats, and unstructured messages. Even though many automated methods for log analysis exist, they usually still rely on domain-specific configurations such as expert-defined detection rules, handcrafted log parsers, or manual feature-engineering. Crucially, the level of automation of conventional methods is limited due to their inability to semantically understand logs and explain their underlying causes. In contrast, Large Language Models enable domain- and format-agnostic interpretation of system logs and security alerts. Unfortunately, research on this topic remains challenging, because publicly available and labeled data sets covering a broad range of attack techniques are scarce. To address this gap, we introduce the Cyber Attack Manifestation Log Data Set (CAM-LDS), comprising seven attack scenarios that cover 81 distinct techniques across 13 tactics and collected from 18 distinct sources within a fully open-source and reproducible test environment. We extract log events that directly result from attack executions to facilitate analysis of manifestations concerning command observability, event frequencies, performance metrics, and intrusion detection alerts. We further present an illustrative case study utilizing an LLM to process the CAM-LDS. The results indicate that correct attack techniques are predicted perfectly for approximately one third of attack steps and adequately for another third, highlighting the potential of LLM-based log interpretation and utility of our data set.

翻译：日志数据对于入侵检测和取证调查至关重要。然而，由于数据量大、事件格式异构以及消息非结构化，人工日志分析十分繁琐。尽管已有多种自动化日志分析方法，但它们通常仍依赖于特定领域的配置，例如专家定义的检测规则、手工构建的日志解析器或人工特征工程。关键问题在于，传统方法由于无法语义化理解日志并解释其根本原因，自动化程度有限。相比之下，大语言模型能够实现与领域和格式无关的系统日志及安全警报解析。遗憾的是，该主题的研究仍面临挑战，因为涵盖广泛攻击技术的公开可用标注数据集十分稀缺。为填补这一空白，我们提出了网络攻击表征日志数据集（CAM-LDS），该数据集包含七种攻击场景，覆盖13种战术下的81种不同技术，并在完全开源且可复现的测试环境中从18个独立数据源采集。我们提取了直接由攻击执行产生的日志事件，以促进关于命令可观测性、事件频率、性能指标及入侵检测警报的表征分析。我们进一步展示了一项利用大语言模型处理CAM-LDS的示例研究。结果表明，约三分之一的攻击步骤能完美预测出正确的攻击技术，另有三分之一能得到充分预测，这凸显了基于大语言模型的日志解析潜力以及本数据集的实用价值。