This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
翻译:本文提出了一种专为表征基于阈值的IT系统而设计的新型结构因果模型,并开发了一种用于快速检测此类系统中异常根因的新算法。当根因之间不存在因果关系时,该方法被证明是有效的;同时,我们提出了一种基于智能体干预的扩展方案以放宽这一假设。该算法及其智能体扩展利用离线数据进行因果发现,并在在线数据中遇到新异常时实施子图遍历。大量实验表明,即便在替代结构因果模型生成的数据或真实IT监控数据上,我们的方法仍展现出卓越性能。