Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.
翻译:大规模云系统故障会造成重大经济损失,使得自动化根因分析(RCA)对运维稳定性至关重要。近期研究尝试利用大语言模型(LLM)代理来自动化此任务,然而现有系统即使采用性能强大的模型,其检测准确率依然较低,且当前评估框架仅评估最终答案的正确性,未能揭示代理推理失败的原因。本文对基于LLM的RCA代理进行了过程级失效分析。我们在五个LLM模型上完整执行了OpenRCA基准测试,生成了1,675次代理运行记录,并将观察到的失效归类为12种缺陷类型,涵盖代理内部推理、代理间通信以及代理-环境交互三个维度。我们的分析表明,最普遍的缺陷——特别是幻觉数据解读和不完全探索——在所有模型中均持续存在,且与模型能力层级无关,这表明这些失效源于共享的代理架构,而非个体模型的局限性。受控缓解实验进一步表明,仅靠提示工程无法解决主导性缺陷,而通过丰富代理间通信协议,可将通信相关故障减少多达15个百分点。本研究提出的缺陷分类法和诊断方法,为设计更可靠的云RCA自主代理奠定了基础。