Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.

翻译：Kubernetes集群故障的诊断可靠性，取决于根因分析系统所报告的结果是否源于故障证据，而非特定场景下的捷径。本文提出图遍历智能体（Graph Traversal Agent），一种结合大语言模型推理与专用工具的图引导根因分析系统。该模型基于类型化的证据图进行推理，同时通过确定性图操作与工具来收集证据、约束搜索范围并验证最终结论。我们将操作约束（包括只读证据收集、传播感知诊断、有界执行及独立验证结论）映射为类型化故障图、LangGraph遍历状态机以及独立的验证阶段。在ITBench快照数据集上，经单一固定qwen-plus裁判评估，该可审计系统在23个场景的公共子集上，根因实体F1得分从同一系统早期版本的0.6087提升至0.9130。通过提示词消融实验，可区分提示词调优带来的增益与去除场景特定提示后仍保留的增益：去除提示词后的配置在19个场景子集上仍保持0.6958的F1得分。该保留增益集中于ChaosMesh场景，其真实根因正是已存在于证据图中的注入故障对象，因此我们将其归因于基准耦合现象，而非通用的跨集群根因分析证据。我们采用轻量级校验手段（包括同裁判对比、提示词消融、级联源检查及遥测无泄漏测试），将结论标记为已验证、待定或超出范围。本研究限定于ITBench OpenTelemetry-demo快照。虽进行了在线集群试验作为工程压力测试，但由于告警状态与链路追踪数据稳定性不足，无法进行受控评分，故本文不作任何生产就绪性或平均修复时间声明。