Root Cause Analysis In Microservice Using Neural Granger Causal Discovery

In recent years, microservices have gained widespread adoption in IT operations due to their scalability, maintenance, and flexibility. However, it becomes challenging for site reliability engineers (SREs) to pinpoint the root cause due to the complex relationships in microservices when facing system malfunctions. Previous research employed structured learning methods (e.g., PC-algorithm) to establish causal relationships and derive root causes from causal graphs. Nevertheless, they ignored the temporal order of time series data and failed to leverage the rich information inherent in the temporal relationships. For instance, in cases where there is a sudden spike in CPU utilization, it can lead to an increase in latency for other microservices. However, in this scenario, the anomaly in CPU utilization occurs before the latency increase, rather than simultaneously. As a result, the PC-algorithm fails to capture such characteristics. To address these challenges, we propose RUN, a novel approach for root cause analysis using neural Granger causal discovery with contrastive learning. RUN enhances the backbone encoder by integrating contextual information from time series, and leverages a time series forecasting model to conduct neural Granger causal discovery. In addition, RUN incorporates Pagerank with a personalization vector to efficiently recommend the top-k root causes. Extensive experiments conducted on the synthetic and real-world microservice-based datasets demonstrate that RUN noticeably outperforms the state-of-the-art root cause analysis methods. Moreover, we provide an analysis scenario for the sock-shop case to showcase the practicality and efficacy of RUN in microservice-based applications. Our code is publicly available at https://github.com/zmlin1998/RUN.

翻译：近年来，微服务因其可扩展性、可维护性和灵活性在IT运维中得到广泛采用。然而，由于微服务间复杂的依赖关系，当系统发生故障时，站点可靠性工程师（SRE）难以精确定位根因。已有研究采用结构化学习方法（如PC算法）建立因果关系，并从因果图中推导根因。但这类方法忽略了时间序列数据的时序顺序，未能充分利用时序关系中蕴含的丰富信息。例如，当某个微服务CPU利用率突增时，会导致其他微服务延迟增加。但在该场景中，CPU利用率异常发生在延迟增加之前，而非同步发生，因此PC算法无法捕捉此类特征。针对上述挑战，我们提出RUN——一种基于对比学习的神经格兰杰因果发现根因分析方法。RUN通过整合时间序列上下文信息增强骨干编码器，并利用时间序列预测模型进行神经格兰杰因果发现。同时，RUN引入带个性化向量的PageRank算法高效推荐前k个根因。在合成数据集和真实微服务数据集上的大量实验表明，RUN显著优于现有最先进的根因分析方法。此外，我们通过sock-shop案例的分析场景展示了RUN在微服务应用中的实用性和有效性。我们的代码已开源：https://github.com/zmlin1998/RUN。