Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.
翻译:确保云服务的可靠性与可用性需要高效的事件根因分析(RCA)。传统RCA方法依赖工程师手动检查日志、追踪等数据源,往往费时费力、易出错,对值班工程师构成巨大挑战。本文提出RCACopilot——一种创新的大语言模型驱动的云事件自动根因分析系统。该系统根据告警类型将新发事件匹配至对应的值班处理人员,聚合关键的运行时诊断信息,预测事件的根因类别并生成解释性说明。我们使用微软近一年的真实事件数据集对RCACopilot进行评估。评估结果表明,RCACopilot的RCA准确率可达0.766。此外,其诊断信息收集组件已在微软成功应用超过四年。