As modern software systems continue to grow in terms of complexity and volume, anomaly detection on multivariate monitoring metrics, which profile systems' health status, becomes more and more critical and challenging. In particular, the dependency between different metrics and their historical patterns plays a critical role in pursuing prompt and accurate anomaly detection. Existing approaches fall short of industrial needs for being unable to capture such information efficiently. To fill this significant gap, in this paper, we propose CMAnomaly, an anomaly detection framework on multivariate monitoring metrics based on collaborative machine. The proposed collaborative machine is a mechanism to capture the pairwise interactions along with feature and temporal dimensions with linear time complexity. Cost-effective models can then be employed to leverage both the dependency between monitoring metrics and their historical patterns for anomaly detection. The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster. Furthermore, we also share our experience of deploying CMAnomaly in Huawei Cloud.
翻译:随着现代软件系统在复杂性和规模上的持续增长,针对表征系统健康状态的多维度监控指标进行异常检测变得愈发关键且具挑战性。不同指标间的依赖关系及其历史模式在实现快速精准的异常检测中起着核心作用。现有方法因无法高效捕获此类信息而难以满足工业需求。为填补这一重要空白,本文提出基于协作机(CMAnomaly)的多维度监控指标异常检测框架。该协作机是一种能以线性时间复杂度捕获特征维度与时间维度成对交互的机制,后续可采用经济高效的模型同时利用监控指标间的依赖关系及其历史模式进行异常检测。本框架已通过公开数据集及华为云大规模在线服务系统采集的工业数据进行全面评估。实验结果表明,与当前最优基线模型相比,CMAnomaly实现了平均F1分数0.9494,较基线提升6.77%至10.68%,且运行速度提升10至20倍。此外,本文还分享了CMAnomaly在华为云的部署经验。