Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
翻译:云服务的事故管理是一个涉及多个步骤的复杂过程,对服务健康状态和开发者生产力均有重大影响。值班工程师在定位根本原因和缓解生产事故时,需要具备大量领域知识并投入大量人工精力。近期人工智能领域的进展催生了如GPT-3.x(包括GPT-3.0和GPT-3.5)等先进的大语言模型,这些模型已被用于解决从问答到文本摘要的各类问题。本研究首次开展大规模评估,探究这些模型在协助工程师定位根本原因和缓解生产事故方面的有效性。我们在微软公司对超过40,000起事故进行了严谨研究,采用语义和词汇评估指标,在零样本、微调及多任务场景下对比了多个大语言模型。最终,通过与实际事故负责人进行人工评估,证明了利用人工智能解决云事故的有效性与未来潜力。