Root cause localization remain challenging in complex and large-scale microservice architectures. The complex fault propagation among microservices and the high dimensionality of telemetry data, including metrics, logs, and traces, limit the effectiveness of existing root cause analysis (RCA) methods. In this paper, a residual-connection-based RCA method using large language model (LLM), named RC-LLM, is proposed. A residual-like hierarchical fusion structure is designed to integrate multi-source telemetry data, while the contextual reasoning capability of large language models is leveraged to model temporal and cross-microservice causal dependencies. Experimental results on CCF-AIOps microservice datasets demonstrate that RC-LLM achieves strong accuracy and efficiency in root cause analysis.
翻译:在复杂的大规模微服务架构中,根因定位仍然具有挑战性。微服务间复杂的故障传播以及遥测数据(包括指标、日志和追踪)的高维特性,限制了现有根因分析方法的有效性。本文提出了一种基于残差连接结构、利用大语言模型的根因分析方法,命名为RC-LLM。该方法设计了一种类残差的层次融合结构以集成多源遥测数据,同时利用大语言模型的上下文推理能力来建模时序和跨微服务的因果依赖关系。在CCF-AIOps微服务数据集上的实验结果表明,RC-LLM在根因分析中实现了优异的准确性和效率。