CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

Post-mortem analysis of compromised systems is a key aspect of cyber forensics, today a mostly manual, slow, and error-prone task. Agentic AI, i.e., LLM-powered agents, is a promising avenue for automation. However, applying such agents to cybersecurity remains largely unexplored and difficult, as this domain demands long-term reasoning, contextual memory, and consistent evidence correlation - capabilities that current LLM agents struggle to master. In this paper, we present the first systematic study of LLM agents to automate post-mortem investigation. As a first scenario, we consider realistic attacks in which remote attackers try to abuse online services using well-known CVEs (30 controlled cases). The agent receives as input the network traces of the attack and extracts forensic evidence. We compare three AI agent architectures, six LLM backends, and assess their ability to i) identify compromised services, ii) map exploits to exact CVEs, and iii) prepare thorough reports. Our best-performing system, CyberSleuth, achieves 80% accuracy on 2025 incidents, producing complete, coherent, and practically useful reports (judged by a panel of 25 experts). We next illustrate how readily CyberSleuth adapts to face the analysis of infected machine traffic, showing that the effective AI agent design can transfer across forensic tasks. Our findings show that (i) multi-agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

翻译：对受攻击系统进行事后分析是网络取证的关键环节，目前主要依赖人工完成，效率低下且易出错。智能体人工智能，即由大语言模型驱动的智能体，是实现自动化的一个前景广阔的途径。然而，将此类智能体应用于网络安全领域在很大程度上仍未得到探索且面临诸多困难，因为该领域需要长期推理、上下文记忆和一致的证据关联能力——这些正是当前LLM智能体难以掌握的能力。本文首次系统性地研究了利用LLM智能体自动化事后调查。作为首个应用场景，我们考虑了攻击者利用已知CVE试图滥用在线服务的真实攻击案例（30个受控案例）。智能体接收攻击的网络流量轨迹作为输入，并从中提取取证证据。我们比较了三种AI智能体架构和六种LLM后端，并评估了它们在以下方面的能力：i) 识别受攻击的服务，ii) 将漏洞利用映射到具体的CVE，以及iii) 准备详尽的报告。我们性能最佳的系统CyberSleuth在2025起事件中达到了80%的准确率，能够生成完整、连贯且具有实际应用价值的报告（由25位专家组成的评审小组评定）。我们进一步展示了CyberSleuth如何轻松适应对受感染主机流量的分析，表明有效的AI智能体设计可以在不同取证任务间迁移。我们的研究结果表明：(i) 多智能体专业化是维持长期推理的关键；(ii) 简单的编排策略优于嵌套的层次化架构；(iii) CyberSleuth的设计能够泛化到不同的取证任务中。