Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs' decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent's evident and consistent superiority over ReAct across all aspects of RCA -- predicting root causes, solutions, evidence, and responsibilities -- and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.
翻译:大语言模型(LLM)在云根因分析中的近期应用已得到积极探索。然而,现有方法仍依赖人工工作流程设置,未能充分发挥LLM的决策制定与环境交互能力。我们提出RCAgent——一种面向实用且隐私敏感的工业级RCA场景的工具增强型LLM自主智能体框架。该框架运行于内部部署模型(非GPT系列),支持自由形式数据采集与基于工具的全面分析。我们的框架融合了多项增强技术,包括独特的动作轨迹自一致性方法,以及用于上下文管理、稳定性提升和领域知识导入的系列方法。实验表明,在根因预测、解决方案、证据归因及责任判定等所有RCA维度上,RCAgent均显著且稳定优于ReAct方法,且其优势覆盖当前规则所涉及与未涉及的任务,该结论得到自动评估指标与人工评价的双重验证。目前,RCAgent已集成至阿里云实时计算平台(Apache Flink)的诊断与问题发现工作流中。