Anomaly detection in multivariate time series is essential across domains such as healthcare, cybersecurity, and industrial monitoring, yet remains fundamentally challenging due to high-dimensional dependencies, the presence of cross-correlations between time-dependent variables, and the scarcity of labeled anomalies. We introduce mTSBench, the largest benchmark to date for multivariate time series anomaly detection and model selection, consisting of 344 labeled time series across 19 datasets from a wide range of application domains. We comprehensively evaluate 24 anomaly detectors, including the only two publicly available large language model-based methods for multivariate time series. Consistent with prior findings, we observe that no single detector dominates across datasets, motivating the need for effective model selection. We benchmark three recent model selection methods and find that even the strongest of them remain far from optimal. Our results highlight the outstanding need for robust, generalizable selection strategies. We open-source the benchmark at https://plan-lab.github.io/mtsbench to encourage future research.
翻译:多变量时间序列异常检测在医疗健康、网络安全和工业监测等领域至关重要,但由于高维依赖性、时间相关变量间存在互相关性以及标记异常的稀缺性,该任务仍面临根本性挑战。本文提出了mTSBench——迄今为止规模最大的多变量时间序列异常检测与模型选择基准测试平台,涵盖来自广泛应用领域的19个数据集共344条标记时间序列。我们系统评估了24种异常检测器,其中包括目前仅有的两种公开可用的基于大语言模型的多变量时间序列方法。与先前研究结论一致,我们发现没有任何单一检测器能在所有数据集上表现最优,这凸显了有效模型选择的必要性。我们对三种最新的模型选择方法进行了基准测试,发现即使其中最强的方法仍远未达到最优水平。我们的研究结果强调了开发鲁棒且可泛化的选择策略的迫切需求。本基准测试已在https://plan-lab.github.io/mtsbench开源,以促进未来研究。