This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
翻译:本文提出了一种专门用于表示基于阈值的IT系统的新型结构因果模型,并设计了一种能够快速检测此类系统中异常根因的新算法。当根因之间不存在因果关系时,该方法被证明具有正确性;同时我们提出了一种基于智能体干预的扩展方法以放宽该假设。我们的算法及其基于智能体的扩展版本利用离线数据进行因果发现,并在在线数据中遇到新异常时执行子图遍历。大量实验表明,即使应用于其他结构因果模型生成的数据或真实IT监控数据,我们的方法仍表现出优越性能。