LATS-RCA: Language Agent Tree Search for Root Cause Analysis in Microservices

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice systems (MSS). However, existing approaches typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose the Language Agent Tree Search for RCA (LATS-RCA) in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search over root-cause hypotheses, where multiple agents iteratively analyze logs and metrics to collect evidence, and reflection scores guide the search toward the most likely root causes for abnormal services. We evaluate LATS-RCA on the open benchmark (LO2), achieving 91.3\% diagnostic accuracy and analyzing the associated computational cost. Variation among the frontier-tier LLMs (Claude Sonnet 4.5, GPT-5, and Gemini 3 Pro) is small, between 89.7\% and 91.3\%, demonstrating our approach is model-agnostic. We also conduct an exploratory study by evaluating LATS-RCA on real-world incidents from a web-hosting company's (Zoner Oy) production MSS that serves over 300,000 websites across Europe. We find that LATS-RCA correctly diagnoses 65.1\% of the production incidents on average over multiple runs. This reveals key challenges of real-world RCA, including multi-factor root causes, large-scale system complexity, and incomplete observability, which are absent from open benchmarks. Future work should develop more realistic open datasets for RCA and validate LATS-RCA with additional datasets. Our replication package is available at https://github.com/kottinov/lats-rca.

翻译：大语言模型（LLM）的最新进展已使微服务系统（MSS）中根因分析（RCA）的自动化初步尝试成为可能。然而，现有方法通常依赖于沿单一诊断路径进行的线性推理过程。本文提出面向MSS的语言智能体树搜索根因分析（LATS-RCA）。LATS-RCA将RCA建模为基于反思引导的树状结构搜索，通过多个智能体迭代分析日志与指标以收集证据，并利用反思评分引导搜索趋向异常服务的最可能根因。我们在公开基准（LO2）上评估LATS-RCA，实现了91.3%的诊断准确率，并分析了相关计算成本。前沿层级LLM（Claude Sonnet 4.5、GPT-5与Gemini 3 Pro）之间的性能差异较小（89.7%~91.3%），表明该方法是模型无关的。我们还通过评估LATS-RCA在一家服务于欧洲30余万网站的Web托管公司（Zoner Oy）生产环境MSS中的实际事件进行探索性研究。结果表明，多次运行中LATS-RCA平均正确诊断65.1%的生产事件。这揭示了现实世界RCA的关键挑战（包括多因素根因、大规模系统复杂性及不完全可观测性），这些挑战在公开基准中均不存在。未来工作应开发更贴近现实的RCA开放数据集，并利用更多数据集验证LATS-RCA。我们的复现包可访问https://github.com/kottinov/lats-rca获取。