Modern cloud services are prone to failures due to their complex architecture, making diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging multiple sources of data, including the alerts, error logs, and domain expertise through past experiences to locate the root cause(s). These experiences are documented as natural language text in outage reports for previous outages. However, utilizing the raw yet rich semi-structured information in the reports systematically is time-consuming. Structured information, on the other hand, such as alerts that are often used during fault diagnosis, is voluminous and requires expert knowledge to discern. Several strategies have been proposed to use each source of data separately for root cause analysis. In this work, we build a diagnostic service called ESRO that recommends root causes and remediation for failures by utilizing structured as well as semi-structured sources of data systematically. ESRO constructs a causal graph using alerts and a knowledge graph using outage reports, and merges them in a novel way to form a unified graph during training. A retrieval-based mechanism is then used to search the unified graph and rank the likely root causes and remediation techniques based on the alerts fired during an outage at inference time. Not only the individual alerts, but their respective importance in predicting an outage group is taken into account during recommendation. We evaluated our model on several cloud service outages of a large SaaS enterprise over the course of ~2 years, and obtained an average improvement of 27% in rouge scores after comparing the likely root causes against the ground truth over state-of-the-art baselines. We further establish the effectiveness of ESRO through qualitative analysis on multiple real outage examples.
翻译:现代云服务因其复杂架构而易发故障,使得诊断成为关键流程。站点可靠性工程师(SRE)需花费数小时综合多种数据源(包括告警、错误日志及过往经验中的领域知识)来定位根本原因。这些经验以自然语言文本形式记录在历史中断报告中。然而,系统性地利用报告中原始且丰富的半结构化信息极为耗时。另一方面,故障诊断中常用的结构化数据(如告警)虽信息量大,但需专家知识进行甄别。现有策略通常单独使用各数据源进行根因分析。本文构建了名为ESRO的诊断服务,通过系统整合结构化与半结构化数据源,推荐故障根因及修复方案。ESRO利用告警构建因果图,利用中断报告构建知识图谱,并在训练阶段以创新方式融合两者形成统一图结构。推理时,基于检索机制搜索该统一图,并根据中断期间触发的告警对可能根因及修复技术进行排序。推荐过程不仅考虑单个告警,还纳入其在预测中断群体中的相对重要性。我们在某大型SaaS企业约两年的多次云服务中断数据上评估模型,将预测根因与真实根因对比,Rouge评分较最优基线平均提升27%。通过多个真实中断案例的定性分析,进一步验证了ESRO的有效性。