While cloud-native microservice architectures have revolutionized software development, their inherent operational complexity makes failure Root Cause Analysis (RCA) a critical yet challenging task. Numerous data-driven RCA models have been proposed to address this challenge. However, we find that the benchmarks used to evaluate these models are often too simple to reflect real-world scenarios. Our preliminary study reveals that simple rule-based methods can achieve performance comparable to or even surpassing state-of-the-art (SOTA) models on four widely used public benchmarks. This finding suggests that the oversimplification of existing benchmarks might lead to an overestimation of the performance of RCA methods. To further investigate the oversimplification issue, we conduct a systematic analysis of popular public RCA benchmarks, identifying key limitations in their fault injection strategies, call graph structures, and telemetry signal patterns. Based on these insights, we propose an automated framework for generating more challenging and comprehensive benchmarks that include complex fault propagation scenarios. Our new dataset contains 1,430 validated failure cases from 9,152 fault injections, covering 25 fault types across 6 categories, dynamic workloads, and hierarchical ground-truth labels that map failures from services down to code-level causes. Crucially, to ensure the failure cases are relevant to IT operations, each case is validated to have a discernible impact on user-facing SLIs. Our re-evaluation of 11 SOTA models on this new benchmark shows that they achieve low Top@1 accuracies, averaging 0.21, with the best-performing model reaching merely 0.37, and execution times escalating from seconds to hours.
翻译:尽管云原生微服务架构已彻底变革软件开发,但其固有的运维复杂性使得故障根因分析成为一项关键而艰巨的任务。为应对这一挑战,众多数据驱动的RCA模型被提出。然而,我们发现用于评估这些模型的基准往往过于简单,无法反映真实场景。我们的初步研究表明,在四个广泛使用的公共基准上,简单的基于规则的方法可以达到甚至超越最先进模型的性能。这一发现表明,现有基准的过度简化可能导致对RCA方法性能的高估。为深入探究此简化问题,我们对流行的公共RCA基准进行了系统性分析,识别了其在故障注入策略、调用图结构和遥测信号模式方面的关键局限。基于这些洞见,我们提出了一个自动化框架,用于生成更具挑战性和全面性的基准,其中包含复杂的故障传播场景。我们的新数据集包含来自9,152次故障注入的1,430个已验证故障案例,涵盖6大类25种故障类型、动态工作负载,以及从服务层映射到代码层原因的层次化真实标签。至关重要的是,为确保故障案例与IT运维相关,每个案例均经验证对面向用户的SLA指标产生了可辨识的影响。在此新基准上对11个SOTA模型的重新评估显示,其Top@1准确率较低,平均仅为0.21,表现最佳的模型也仅达到0.37,且执行时间从数秒激增至数小时。