In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
翻译:近年来,分布式微服务架构在工业界的广泛采用显著提高了对系统可用性与鲁棒性的需求。由于企业级微服务系统中复杂的服务调用路径与依赖关系,在服务调用过程中及时定位异常具有挑战性,从而给系统的正常运维带来困难。本文针对包含追踪数据、日志及系统监控指标的多模态微服务系统,提出一种基于因果异质图的根因分析框架CHASE。具体而言,相关特征被编码为表征向量,并通过多模态调用图进行建模。随后,通过来自相邻指标节点与日志节点的异质注意力消息传递,对每个实例节点执行异常检测。最后,CHASE通过以超边表示因果流向的超图进行学习,并实现根因定位。我们在两个具有不同属性的公开微服务数据集上评估所提框架,并与前沿方法进行比较。结果表明,CHASE在A@1和Percentage@1指标上分别较最佳基线方法平均提升达36.2%和29.4%。