Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability. When a fault occurs, only the measure values within specific attribute combinations are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multidimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 seconds across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.
翻译:针对多维数据定位根因对保障在线服务系统的可靠性至关重要。当故障发生时,仅特定属性组合下的度量值会出现异常。这些属性组合作为潜在根因的重要线索,被称为多维数据的根因。本文提出了一种通用鲁棒的根因定位方法PSqueeze。我们提出了多维数据根因的通用属性——广义涟漪效应(GRE)。基于此,我们设计了一种新型概率聚类方法和鲁棒启发式搜索算法。此外,本文首次在文献中明确了外部根因判定的重要性,并提出有效方法。在包含5400个故障的两个真实数据集上的实验表明,PSqueeze的F1分数较基线方法提升32.89%,且所有案例的定位时间均保持在10秒左右。在外部根因判定任务中,PSqueeze的F1分数达到0.90。多个生产系统的案例研究进一步证实,PSqueeze对现实场景中的故障诊断具有实际价值。