Ensuring the reliability and user satisfaction of cloud services necessitates prompt anomaly detection followed by diagnosis. Existing techniques for anomaly detection focus solely on real-time detection, meaning that anomaly alerts are issued as soon as anomalies occur. However, anomalies can propagate and escalate into failures, making faster-than-real-time anomaly detection highly desirable for expediting downstream analysis and intervention. This paper proposes Maat, the first work to address anomaly anticipation of performance metrics in cloud services. Maat adopts a novel two-stage paradigm for anomaly anticipation, consisting of metric forecasting and anomaly detection on forecasts. The metric forecasting stage employs a conditional denoising diffusion model to enable multi-step forecasting in an auto-regressive manner. The detection stage extracts anomaly-indicating features based on domain knowledge and applies isolation forest with incremental learning to detect upcoming anomalies. Thus, our method can uncover anomalies that better conform to human expertise. Evaluation on three publicly available datasets demonstrates that Maat can anticipate anomalies faster than real-time comparatively or more effectively compared with state-of-the-art real-time anomaly detectors. We also present cases highlighting Maat's success in forecasting abnormal metrics and discovering anomalies.
翻译:摘要:确保云服务的可靠性与用户满意度需在异常发生后迅速检测并诊断。现有异常检测技术仅关注实时检测,即异常一旦发生便立即发出警报。然而,异常可能传播并升级为故障,因此能够比实时更早发现异常对加速下游分析与干预具有重要意义。本文提出Maat,这是首个针对云服务性能指标异常预警的研究工作。Maat采用一种新颖的两阶段异常预警范式,包含指标预测与基于预测结果的异常检测两个阶段。指标预测阶段采用条件去噪扩散模型,以自回归方式实现多步预测;检测阶段基于领域知识提取异常指示特征,并采用增量式孤立森林算法检测即将发生的异常。通过这种方法,我们的方法能挖掘更符合人类专家经验的异常模式。在三个公开数据集上的评估表明,与最先进的实时异常检测器相比,Maat能以更快速度(或更优效果)提前预警异常。我们还通过案例展示了Maat在异常指标预测与异常发现方面的成功表现。