Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments

With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.

翻译：随着云原生技术的发展，基于微服务的软件系统在故障发生时面临根因准确定位的挑战。此外，云边协同环境引入了更多困难，例如网络不稳定和跨网段高延迟。因此，在云边协同环境中准确识别微服务的根因已成为一个紧迫的问题。本文提出了一种新颖方法 MicroCERCL，该方法能够在云边协同环境下精确定位内核级和应用级的根因。我们的核心洞见是：在具有不稳定性和高延迟特性的云边协同环境中，故障会通过直接调用和间接资源竞争依赖关系进行传播。这在同时涉及多个微服务系统的混合部署中将变得更加复杂。基于这一洞见，我们从内核级日志中提取有效内容，以优先定位内核级根因。此外，我们构建了一个异构动态拓扑栈，并训练了一个图神经网络模型，从而在不依赖历史数据的情况下准确定位应用级根因。值得注意的是，我们发布了首个云边协同环境下的基准混合部署微服务系统（据我们所知是规模最大且最复杂的）。在从该基准系统收集的数据集上进行的实验表明，MicroCERCL 能够在此类环境中准确定位微服务系统的根因，其 top-1 准确率至少提升 24.1%，显著优于现有最先进方法。