Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines -- conditions that obscure whether failures arise from reasoning itself or from peripheral design choices. We present a focused empirical evaluation that isolates an LLM's reasoning behavior. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totaling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labeled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.
翻译:根因分析(RCA)对于诊断复杂软件系统中的故障以确保系统可靠性至关重要。现代云基系统高度分布式且相互依赖的特性常常使RCA工作复杂化,尤其是在多跳故障传播场景中,其症状往往远离真实原因。大语言模型(LLMs)的最新进展为增强自动化RCA带来了新的机遇。然而,它们对RCA的实际价值取决于其推理与决策的保真度。现有工作依赖于历史事件语料库,直接操作超出当前LLM能力的大规模遥测数据,或将推理嵌入复杂的多智能体流程中——这些条件模糊了故障是源于推理本身还是源于外围设计选择。我们提出了一项聚焦的实证评估,以隔离LLM的推理行为。我们设计了一个受控实验框架,通过简化的实验设置来凸显LLM的作用。我们在两个真实世界案例研究(GAIA和OpenRCA)上,评估了六种LLM在两种智能体工作流(ReAct和Plan-and-Execute)和一个非智能体基线下的表现。总计,我们执行了48,000个模拟故障场景,累计执行时间达228天。我们同时测量了根因准确性和中间推理轨迹的质量。我们构建了一个包含16种常见RCA推理失效的标记分类法,并使用LLM-as-a-Judge进行标注。我们的结果阐明了当前开源LLM在多跳RCA中的成功与失败之处,量化了对输入数据模态的敏感性,并识别了能预测最终正确性的推理失效。这些贡献共同提供了透明且可复现的实证结果以及一个失效分类法,以指导未来基于推理的系统诊断研究。