Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The AnomalyCD presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of the approach on two datasets: monitoring sensor data from the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public dataset from an information technology monitoring system. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly datasets. Code: https://github.com/muleina/AnomalyCD .
翻译:从监测系统检测到系统故障后,提取异常因果关系有助于诊断。在大型系统中识别异常原因需要跨多个子系统调查更广泛的监测变量集。然而,学习图因果模型(GCMs)会带来显著的计算负担,这限制了大多数现有方法在实时和大规模部署中的适用性。此外,现代大型系统的监测应用通常产生大量二进制报警标志,而二进制异常数据的独特特征——状态转换的含义和数据稀疏性——对现有因果学习机制构成了挑战。本研究提出了一种异常因果发现方法(AnomalyCD),解决了从时间二进制标志数据集生成GCMs时的准确性和计算挑战。AnomalyCD提出了多种策略,例如异常数据感知因果测试、稀疏数据和先验链接压缩以及边剪枝调整方法。我们在两个数据集上验证了该方法的性能:来自CERN紧凑型μ子螺线管实验读出盒系统的监测传感器数据,以及来自信息技术监测系统的公开数据集。在时间GCMs上的结果表明,该方法显著降低了计算开销,并在二进制异常数据集上适度提高了准确性。代码:https://github.com/muleina/AnomalyCD。