Massive key performance indicators (KPIs) are monitored as multivariate time series data (MTS) to ensure the reliability of the software applications and service system. Accurately detecting the abnormality of MTS is very critical for subsequent fault elimination. The scarcity of anomalies and manual labeling has led to the development of various self-supervised MTS anomaly detection (AD) methods, which optimize an overall objective/loss encompassing all metrics' regression objectives/losses. However, our empirical study uncovers the prevalence of conflicts among metrics' regression objectives, causing MTS models to grapple with different losses. This critical aspect significantly impacts detection performance but has been overlooked in existing approaches. To address this problem, by mimicking the design of multi-gate mixture-of-experts (MMoE), we introduce CAD, a Conflict-aware multivariate KPI Anomaly Detection algorithm. CAD offers an exclusive structure for each metric to mitigate potential conflicts while fostering inter-metric promotions. Upon thorough investigation, we find that the poor performance of vanilla MMoE mainly comes from the input-output misalignment settings of MTS formulation and convergence issues arising from expansive tasks. To address these challenges, we propose a straightforward yet effective task-oriented metric selection and p&s (personalized and shared) gating mechanism, which establishes CAD as the first practicable multi-task learning (MTL) based MTS AD model. Evaluations on multiple public datasets reveal that CAD obtains an average F1-score of 0.943 across three public datasets, notably outperforming state-of-the-art methods. Our code is accessible at https://github.com/dawnvince/MTS_CAD.
翻译:海量关键性能指标(KPI)以多变量时间序列数据(MTS)形式被监测,以确保软件应用和服务系统的可靠性。准确检测MTS的异常对后续故障消除至关重要。异常稀缺及人工标注的困难催生了多种自监督MTS异常检测(AD)方法,这些方法通过优化一个涵盖所有指标回归目标/损失的整体目标/损失函数来实现。然而,我们的实证研究发现,指标间回归目标普遍存在冲突,导致MTS模型难以同时优化不同损失函数。这一关键因素显著影响检测性能,却在现有方法中被忽视。为解决该问题,通过借鉴多门控混合专家(MMoE)的设计理念,我们提出CAD——一种冲突感知的多变量KPI异常检测算法。CAD为每个指标提供专属结构以缓解潜在冲突,同时促进指标间的相互提升。深入研究后,我们发现原始MMoE的性能不佳主要源于MTS建模中输入-输出错配的设置以及庞大任务引发的收敛问题。为应对这些挑战,我们提出一种简单但有效的面向任务的指标选择与个性化共享(p&s)门控机制,使CAD成为首个基于多任务学习(MTL)的可实际部署的MTS异常检测模型。在多个公共数据集上的评估显示,CAD在三个数据集上平均F1分数达0.943,显著优于现有最优方法。我们的代码开源在https://github.com/dawnvince/MTS_CAD。