Edge computing environments host increasingly complex microservice-based IoT applications that are prone to performance anomalies propagating across dependent services. Identifying the faulty component (root cause localization) and the underlying fault type (root cause analysis) is essential for timely mitigation. Supervised graph neural networks (GNNs) currently represent the state of the art for joint root cause localization and analysis. However, existing approaches rely on centralized processing over full-system graphs, leading to high inference latency and limited scalability in large, distributed edge environments. In this paper, we propose a cascaded GNN framework for joint RCL and fault type identification that explicitly addresses these scalability challenges. Our approach employs communication-driven clustering to partition large service graphs into highly interacting communities and a cascaded network with two subnetworks that perform hierarchical RCL/RCA. By restricting message passing to reduced and structured subgraphs, the proposed framework significantly lowers computational complexity while preserving critical dependency information. We evaluate the proposed method on the MicroCERCL benchmark and large-scale datasets generated using the iAnomaly simulation framework. Experimental results show that the cascaded architecture achieves diagnostic accuracy comparable to centralized GNN baselines while maintaining near-constant inference latency as graph size increases, enabling scalable and actionable AIOps in edge computing environments.
翻译:边缘计算环境承载着日益复杂的基于微服务的物联网应用,这些应用容易在依赖服务间传播性能异常。识别故障组件(根因定位)及其潜在故障类型(根因分析)对于及时缓解问题至关重要。目前,监督式图神经网络代表了联合根因定位与分析领域的最先进技术。然而,现有方法依赖于在全系统图上进行集中式处理,导致在大型分布式边缘环境中推理延迟高且可扩展性有限。本文提出一种用于联合根因定位与故障类型识别的级联图神经网络框架,旨在明确应对这些可扩展性挑战。我们的方法采用通信驱动聚类将大型服务图划分为高交互社区,并通过包含两个子网络的级联网络执行分层根因定位/分析。通过将消息传递限制在简化且结构化的子图上,所提框架在保留关键依赖信息的同时,显著降低了计算复杂度。我们在MicroCERCL基准测试集以及使用iAnomaly仿真框架生成的大规模数据集上评估了所提方法。实验结果表明,该级联架构在实现与集中式图神经网络基线相当诊断准确率的同时,能保持随图规模增长近乎恒定的推理延迟,从而在边缘计算环境中实现可扩展且可操作的智能运维。