Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.
翻译:复杂系统中的故障需要进行快速的根因分析(RCA)以防止级联损害。现有的无需依赖图即可运行的RCA方法通常假设根因具有最高的异常分数。当故障传播时,这一假设会失效,因为根因处的微小延迟可能在下游累积成更大的异常。在本文中,我们提出了PRISM,一个在依赖图缺失时进行RCA的简单高效框架。我们构建了一类基于组件的系统模型,在此模型下PRISM能够提供理论保证地执行RCA。在涵盖9个真实世界数据集的735次故障中,PRISM实现了68%的Top-1准确率,相比最佳基线提升了258%,同时每次诊断仅需8毫秒。