Marlin: Knowledge-Driven Analysis of Provenance Graphs for Efficient and Robust Detection of Cyber Attacks

Recent research in both academia and industry has validated the effectiveness of provenance graph-based detection for advanced cyber attack detection and investigation. However, analyzing large-scale provenance graphs often results in substantial overhead. To improve performance, existing detection systems implement various optimization strategies. Yet, as several recent studies suggest, these strategies could lose necessary context information and be vulnerable to evasions. Designing a detection system that is efficient and robust against adversarial attacks is an open problem. We introduce Marlin, which approaches cyber attack detection through real-time provenance graph alignment.By leveraging query graphs embedded with attack knowledge, Marlin can efficiently identify entities and events within provenance graphs, embedding targeted analysis and significantly narrowing the search space. Moreover, we incorporate our graph alignment algorithm into a tag propagation-based schema to eliminate the need for storing and reprocessing raw logs. This design significantly reduces in-memory storage requirements and minimizes data processing overhead. As a result, it enables real-time graph alignment while preserving essential context information, thereby enhancing the robustness of cyber attack detection. Moreover, Marlin allows analysts to customize attack query graphs flexibly to detect extended attacks and provide interpretable detection results. We conduct experimental evaluations on two large-scale public datasets containing 257.42 GB of logs and 12 query graphs of varying sizes, covering multiple attack techniques and scenarios. The results show that Marlin can process 137K events per second while accurately identifying 120 subgraphs with 31 confirmed attacks, along with only 1 false positive, demonstrating its efficiency and accuracy in handling massive data.

翻译：学术界与工业界的最新研究已证实，基于溯源图的检测方法在高级网络攻击检测与调查中具有显著效果。然而，分析大规模溯源图通常会产生巨大的开销。为提升性能，现有检测系统采用了多种优化策略。但近期多项研究表明，这些策略可能会丢失必要的上下文信息，且易受规避攻击影响。设计一种高效且能抵御对抗攻击的鲁棒检测系统仍是一个开放性问题。本文提出Marlin系统，通过实时溯源图对齐实现网络攻击检测。该系统利用嵌入攻击知识的查询图，能够高效识别溯源图中的实体与事件，实现针对性分析并大幅缩小搜索空间。此外，我们将图对齐算法与基于标签传播的架构相结合，无需存储或重新处理原始日志。该设计显著降低了内存存储需求并最小化数据处理开销，从而在保持关键上下文信息的同时实现实时图对齐，增强了网络攻击检测的鲁棒性。同时，Marlin允许分析人员灵活定制攻击查询图，以检测扩展攻击并提供可解释的检测结果。我们在两个包含257.42 GB日志数据的大型公开数据集上进行了实验评估，使用12个不同规模的查询图覆盖多种攻击技术与场景。结果表明，Marlin每秒可处理13.7万条事件，准确识别出120个子图（包含31个已确认攻击），且仅产生1个误报，证明了其处理海量数据的高效性与准确性。