Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative On-call system empowered by the Large Language Model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from serviceX in companyX. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at companyX for over four years.
翻译:确保云服务的可靠性和可用性需要对云事件进行高效的根因分析(RCA)。传统的RCA方法依赖于对日志和跟踪等数据源的人工调查,对于值班工程师而言,往往费时费力、容易出错且具有挑战性。在本文中,我们介绍了RCACopilot,一种由大语言模型驱动的创新性在线支持系统,用于自动化云事件的根因分析。RCACopilot根据传入事件的告警类型将其匹配至相应的处理人员,聚合关键的运行时诊断信息,预测事件的根因类别,并提供解释性叙述。我们使用来自公司X的serviceX一整年事件构成的实际数据集评估了RCACopilot。评估结果表明,RCACopilot的根因分析准确率可达0.766。此外,RCACopilot的诊断信息收集组件已在公司X成功应用超过四年。