Assess and Summarize: Improve Outage Understanding with Large Language Models

Pengxiang Jin,Shenglin Zhang,Minghua Ma,Haozhe Li,Yu Kang,Liqun Li,Yudong Liu,Bo Qiao,Chaoyun Zhang,Pu Zhao,Shilin He,Federica Sarro,Yingnong Dang,Saravan Rajmohan,Qingwei Lin,Dongmei Zhang

Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are affected by a cloud outage, users can experience slow response times, connection issues or total service disruption, resulting in a significant negative business impact. Outages are usually comprised of several concurring events/source causes, and therefore understanding the context of outages is a very challenging yet crucial first step toward mitigating and resolving outages. In current practice, on-call engineers with in-depth domain knowledge, have to manually assess and summarize outages when they happen, which is time-consuming and labor-intensive. In this paper, we first present a large-scale empirical study investigating the way on-call engineers currently deal with cloud outages at Microsoft, and then present and empirically validate a novel approach (dubbed Oasis) to help the engineers in this task. Oasis is able to automatically assess the impact scope of outages as well as to produce human-readable summarization. Specifically, Oasis first assesses the impact scope of an outage by aggregating relevant incidents via multiple techniques. Then, it generates a human-readable summary by leveraging fine-tuned large language models like GPT-3.x. The impact assessment component of Oasis was introduced in Microsoft over three years ago, and it is now widely adopted, while the outage summarization component has been recently introduced, and in this article we present the results of an empirical evaluation we carried out on 18 real-world cloud systems as well as a human-based evaluation with outage owners. The results show that Oasis can effectively and efficiently summarize outages, and lead Microsoft to deploy its first prototype which is currently under experimental adoption by some of the incident teams.

翻译：近年来，云系统因其灵活性和可扩展性而日益普及。每次云上托管的计算应用和服务受到云故障影响时，用户可能经历响应缓慢、连接问题或服务完全中断，从而造成显著的负面业务影响。故障通常由多个并发事件/根源原因组成，因此理解故障上下文是缓解和解决故障的关键且极具挑战性的第一步。当前实践中，具备深厚领域知识的值班工程师必须在故障发生时手动评估和总结，这既耗时又费力。本文首先对微软值班工程师当前处理云故障的方式进行大规模实证研究，随后提出并实证验证了一种新方法（称为Oasis），以协助工程师完成此任务。Oasis能够自动评估故障的影响范围，并生成易于理解的总结。具体而言，Oasis首先通过多种技术聚合相关事件来评估故障的影响范围，然后利用微调后的大型语言模型（如GPT-3.x）生成易于理解的总结。Oasis的影响评估组件于三年前在微软引入，现已广泛采用；而故障总结组件则于近期推出，本文呈现了在18个真实云系统上进行的实证评估结果，以及与故障负责人共同进行的人工评估结果。结果表明，Oasis能够高效、有效地总结故障，并促使微软部署其首个原型，目前该原型正由部分事件团队进行实验性试用。