Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.
翻译:微服务根因定位面临云原生系统固有异质性的根本挑战,这种异质性包含多样化的可观测数据与多类系统实体。现有方法通常仅聚焦单一异质性维度,难以充分挖掘其诊断价值。本研究系统分析了微服务系统及根因定位过程中异质性的多重作用机制,进而深入探究实体层级差异及其非对称依赖关系对故障行为的影响。基于两个微服务基准数据集的实证分析表明,实体层级异质性自然催生出高度非对称的故障传播模式,其主导特征表现为服务与主机间的跨层交互。据此,我们提出半监督框架NexusRCL,通过将服务与主机形式化为异质图中的不同节点类型来内化上述传播模式。该设计结合基于事件的抽象机制,使NexusRCL能有效捕捉数据层与实体层的双重异质性,并通过主动学习降低标注成本。在两个工业基准数据集上的全面评估表明,相较现有最优方法,NexusRCL在Top-1准确率(A@1)上最高提升49.85%,在Top-5平均准确率(A@5)上提升32.70%。