Cloud-native microservices enable rapid iteration and scalable deployment but also create complex, fast-evolving dependencies that challenge reliable diagnosis. Existing root cause analysis (RCA) approaches, even with multi-modal fusion of logs, traces, and metrics, remain limited in capturing dynamic behaviors and shifting service relationships. Three critical challenges persist: (i) inadequate modeling of cascading fault propagation, (ii) vulnerability to noise interference and concept drift in normal service behavior, and (iii) over-reliance on service deviation intensity that obscures true root causes. To address these challenges, we propose DynaCausal, a dynamic causality-aware framework for RCA in distributed microservice systems. DynaCausal unifies multi-modal dynamic signals to capture time-varying spatio-temporal dependencies through interaction-aware representation learning. It further introduces a dynamic contrastive mechanism to disentangle true fault indicators from contextual noise and adopts a causal-prioritized pairwise ranking objective to explicitly optimize causal attribution. Comprehensive evaluations on public benchmarks demonstrate that DynaCausal consistently surpasses state-of-the-art methods, attaining an average AC@1 of 0.63 with absolute gains from 0.25 to 0.46, and delivering both accurate and interpretable diagnoses in highly dynamic microservice environments.
翻译:云原生微服务支持快速迭代和可扩展部署,但也形成了复杂且快速演化的依赖关系,给可靠诊断带来了挑战。现有的根因分析方法,即使融合了日志、追踪和指标等多模态数据,在捕捉动态行为和变化的服务关系方面仍存在局限。三个关键挑战持续存在:(i)级联故障传播建模不足,(ii)易受正常服务行为中的噪声干扰和概念漂移影响,(iii)过度依赖服务偏差强度,从而掩盖了真实的根本原因。为解决这些挑战,我们提出了DynaCausal,一种面向分布式微服务系统的动态因果感知根因分析框架。DynaCausal通过交互感知表示学习统一多模态动态信号,以捕捉时变的时空依赖关系。它进一步引入了动态对比机制,将真实的故障指标与上下文噪声解耦,并采用因果优先的成对排序目标来显式优化因果归因。在公开基准上的综合评估表明,DynaCausal持续超越现有最先进方法,平均AC@1达到0.63,绝对增益从0.25到0.46,在高度动态的微服务环境中实现了准确且可解释的诊断。