Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed for LLM deployments that present distinct runtime characteristics. In this study, we evaluate the effectiveness of RCA methods on a best-practice LLM inference deployment under controlled failure injections. Across 24 methods (20 metric-based, two trace-based, and two multi-source), we find that multi-source approaches achieve the highest accuracy, metric-based methods show fault-type-dependent performance, and trace-based methods largely fail. These results reveal that existing RCA tools do not generalize to LLM systems, motivating tailored analysis techniques and enhanced observability, for which we formulate guidelines.
翻译:大型语言模型(LLM)服务已成为搜索、辅助和决策应用不可或缺的组成部分。然而,与传统Web或微服务不同,支撑LLM推理部署的硬件和软件栈复杂度更高,且远未经过充分的现场测试,因此更容易发生难以解决的故障。控制服务中断成本和服务质量下降依赖于缩短平均修复时间,而这在实践中受限于故障识别、定位和诊断的速度。自动化根因分析(RCA)通过识别发生故障的系统组件并追踪故障传播路径,加速了故障定位。针对传统服务,已有大量RCA方法被开发出来,它们利用请求路径追踪、资源指标和日志数据分析。然而,现有的RCA方法并非为具有独特运行时特性的LLM部署而设计。在本研究中,我们在受控故障注入下,评估了RCA方法在一个最佳实践LLM推理部署上的有效性。通过对24种方法(20种基于指标、两种基于追踪、两种基于多源)的评估,我们发现多源方法实现了最高的准确率,基于指标的方法表现出与故障类型相关的性能,而基于追踪的方法大多失效。这些结果表明,现有的RCA工具无法泛化至LLM系统,这推动了对定制化分析技术和增强可观测性的需求,为此我们提出了相关指导原则。