Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative On-call system empowered by the Large Language Model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from serviceX in companyX. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at companyX for over four years.
翻译:确保云服务的可靠性和可用性需要对云事件进行高效的根因分析(RCA)。传统的 RCA 方法依赖于人工调查日志和追踪等数据源,往往耗时、易错,且对值班工程师构成挑战。本文介绍了 RCACopilot,一个由大语言模型赋能、用于自动化云事件 RCA 的创新性值班系统。RCACopilot 根据告警类型将传入的事件分配给相应的处理人员,聚合关键的运行时诊断信息,预测事件的根因类别,并提供解释性描述。我们使用来自公司X中服务X为期一年的真实事件数据集对 RCACopilot 进行了评估。评估结果表明,RCACopilot 的 RCA 准确率高达 0.766。此外,RCACopilot 的诊断信息收集组件已在公司X成功使用超过四年。