Investigating cybersecurity incidents requires collecting and analyzing evidence from multiple log sources, including intrusion detection alerts, network traffic records, and authentication events. This process is labor-intensive: analysts must sift through large volumes of data to identify relevant indicators and piece together what happened. We present a RAG-based system that performs security incident analysis through targeted query-based filtering and LLM semantic reasoning. The system uses a query library with associated MITRE ATT&CK techniques to extract indicators from raw logs, then retrieves relevant context to answer forensic questions and reconstruct attack sequences. We evaluate the system with eight LLM configurations on malware traffic incidents and a multi-stage Active Directory attack. We find that LLMs have different performance and tradeoffs, with Claude Sonnet 4 achieving 94% and DeepSeek V3 achieving 89% average recall across 17 malware scenarios, while DeepSeek costs 15$\times$ less than Claude per analysis, and locally-deployed Llama 3.1:70b achieves 81% recall at zero per-query cost. Attack step detection on the Active Directory scenario reaches 100% precision and up to 96% recall with an enumeration prompt. These results demonstrate that combining targeted query-based filtering with RAG-based retrieval -- confirmed essential by ablation studies -- enables accurate, cost-effective security analysis within LLM context limits.
翻译:调查网络安全事件需要收集并分析来自多个日志源的证据,包括入侵检测告警、网络流量记录以及认证事件。这一过程劳动密集:分析师必须筛选大量数据以识别相关指标并还原事件全貌。我们提出一个基于检索增强生成(RAG)的系统,通过目标化查询过滤与大语言模型语义推理执行安全事件分析。该系统利用包含关联MITRE ATT&CK技术的查询库从原始日志中提取指标,随后检索相关上下文以回答取证问题并重构攻击序列。我们采用八种大语言模型配置在恶意软件流量事件及多阶段活动目录攻击场景下评估系统。研究发现,不同大语言模型在性能与权衡方面存在差异:Claude Sonnet 4在17个恶意软件场景中平均召回率达94%,DeepSeek V3达89%,但DeepSeek单次分析成本仅为Claude的1/15,而本地部署的Llama 3.1:70b在零单次查询成本下实现81%召回率。在活动目录场景的攻击步骤检测中,通过枚举提示达到100%精确率与最高96%召回率。这些结果表明,将目标化查询过滤与基于RAG的检索相结合(消融实验证实其必要性)能在LLM上下文长度限制内实现准确、成本可控的安全分析。