PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis

Major cloud providers have employed advanced AI-based solutions like large language models to aid humans in identifying the root causes of cloud incidents. Despite the growing prevalence of AI-driven assistants in the root cause analysis process, their effectiveness in assisting on-call engineers is constrained by low accuracy due to the intrinsic difficulty of the task, a propensity for LLM-based approaches to hallucinate, and difficulties in distinguishing these well-disguised hallucinations. To address this challenge, we propose to perform confidence estimation for the predictions to help on-call engineers make decisions on whether to adopt the model prediction. Considering the black-box nature of many LLM-based root cause predictors, fine-tuning or temperature-scaling-based approaches are inapplicable. We therefore design an innovative confidence estimation framework based on prompting retrieval-augmented large language models (LLMs) that demand a minimal amount of information from the root cause predictor. This approach consists of two scoring phases: the LLM-based confidence estimator first evaluates its confidence in making judgments in the face of the current incident that reflects its ``grounded-ness" level in reference data, then rates the root cause prediction based on historical references. An optimization step combines these two scores for a final confidence assignment. We show that our method is able to produce calibrated confidence estimates for predicted root causes, validate the usefulness of retrieved historical data and the prompting strategy as well as the generalizability across different root cause prediction models. Our study takes an important move towards reliably and effectively embedding LLMs into cloud incident management systems.

翻译：主流云服务提供商已采用基于大语言模型等先进AI解决方案，辅助人工识别云事件根因。尽管AI驱动助手在根因分析过程中的应用日益广泛，但受限于任务本身的内在难度、LLM方法易产生幻觉的特性以及难以甄别这些精心伪装的幻觉，其辅助值班工程师的有效性受制于低准确率。为应对这一挑战，我们提出对预测结果进行置信度估计，以帮助值班工程师决策是否采纳模型预测。鉴于多数基于LLM的根因预测器具有黑箱特性，微调或基于温度缩放的方法无法适用。因此，我们设计了一种创新的置信度估计框架，该框架基于提示增强的检索增强大语言模型，仅需从根因预测器获取最少信息。该方法包含两个评分阶段：首先，基于LLM的置信度估计器评估其在当前事件中进行判断的置信度，该置信度反映其在参考数据中的"可溯性"水平；随后基于历史参考数据对根因预测进行评级。优化步骤将这两个分数合并为最终置信度赋值。实验表明，我们的方法能够为预测的根因生成校准的置信度估计，验证了检索历史数据的有效性、提示策略的效用以及跨不同根因预测模型的泛化能力。本研究为可靠且有效地将LLM嵌入云事件管理系统迈出了重要一步。