Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent's evident and consistent superiority over ReAct across all aspects of RCA -- predicting root causes, solutions, evidence, and responsibilities -- and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.
翻译:近年来,大语言模型在云根因分析中的应用得到了积极探索。然而,现有方法仍依赖于人工设置工作流,未能充分发挥大语言模型的决策与环境交互能力。本文提出RCAgent,一种工具增强型大语言模型自主智能体框架,旨在实现实用且注重隐私的工业级根因分析应用。RCAgent运行于内部部署模型而非GPT系列模型之上,能够借助工具进行自由形式的数据收集与综合分析。我们的框架融合了多种增强机制,包括独特的行动轨迹自洽性校验,以及一套用于上下文管理、稳定性提升与领域知识导入的方法。实验表明,无论是在当前规则覆盖还是未覆盖的任务中,RCAgent在根因分析的所有层面——包括根因预测、解决方案生成、证据链构建与责任归属判定——均显著且持续优于ReAct框架,该结论已通过自动化指标与人工评估双重验证。此外,RCAgent已成功集成至阿里云实时计算平台(基于Apache Flink)的故障诊断与问题发现工作流中。