Deep learning models have become the dominant approach for multivariate time series anomaly detection (MTSAD), often reporting substantial performance improvements over classical statistical methods. However, these gains are frequently evaluated under heterogeneous thresholding strategies and evaluation protocols, making fair comparisons difficult. This work revisits OmniAnomaly, a widely used stochastic recurrent model for MTSAD, and systematically compares it with a simple linear baseline based on Principal Component Analysis (PCA) on the Server Machine Dataset (SMD). Both methods are evaluated under identical thresholding and evaluation procedures, with experiments repeated across 100 runs for each of the 28 machines in the dataset. Performance is evaluated using Precision, Recall and F1-score at point-level, with and without point-adjustment, and under different aggregation strategies across machines and runs, with the corresponding standard deviations also reported. The results show large variability across machines and show that PCA can achieve performance comparable to OmniAnomaly, and even outperform it when point-adjustment is not applied. These findings question the added value of more complex architectures under current benchmarking practices and highlight the critical role of evaluation methodology in MTSAD research.
翻译:深度学习模型已成为多变量时间序列异常检测(MTSAD)的主流方法,通常声称相比经典统计方法有显著的性能提升。然而,这些增益常常是在异质的阈值策略和评估协议下进行评估的,使得公平比较变得困难。本工作重新审视了 OmniAnomaly(一种广泛用于 MTSAD 的随机循环模型),并在服务器机器数据集(SMD)上将其与基于主成分分析(PCA)的简单线性基线进行了系统性比较。两种方法均在相同的阈值和评估流程下进行评估,实验针对数据集中 28 台机器每台重复运行 100 次。性能使用逐点级别的精确率、召回率和 F1 分数进行评价,分别考虑是否采用点调整,并采用跨机器和跨运行的不同聚合策略,同时报告相应的标准差。结果表明,不同机器间性能存在巨大差异,且 PCA 能够达到与 OmniAnomaly 相当的性能,甚至在未应用点调整时表现更优。这些发现对当前基准测试实践下更复杂架构的附加价值提出了质疑,并突显了评估方法在 MTSAD 研究中的关键作用。