Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X's services, AlertGuardian significantly mitigates alert fatigue (94.8\% alert reduction ratios) and accelerates fault diagnosis (90.5\% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management after the deployment of AlertGuardian in Company-X.
翻译:告警对于大规模云系统中的异常检测至关重要,能够确保系统可靠性与用户体验。然而,当前系统产生海量告警,由于告警生命周期管理效率低下,导致运维效率下降。本文详细阐述了X公司为优化告警生命周期管理、应对云系统告警疲劳所做的努力。我们提出了AlertGuardian框架,该框架协同大型语言模型(LLMs)与轻量级图模型,通过三个阶段优化告警生命周期:告警去噪采用带虚拟噪声的图学习模型过滤噪声,告警摘要利用基于检索增强生成(RAG)的LLMs生成可操作的摘要,告警规则优化则借助多智能体迭代反馈提升告警规则质量。在X公司服务的四个真实数据集上的评估表明,AlertGuardian显著缓解了告警疲劳(告警减少率达94.8%),并加速了故障诊断(诊断准确率达90.5%)。此外,AlertGuardian优化了1,174条告警规则,其中375条被站点可靠性工程师(SREs)采纳(采纳率为32%)。最后,我们分享了AlertGuardian在X公司部署后,关于告警生命周期管理的成功经验与教训。