TimeSeriesBench: An Industrial-Grade Benchmark for Time Series Anomaly Detection Models

Driven by the proliferation of real-world application scenarios and scales, time series anomaly detection (TSAD) has attracted considerable scholarly and industrial interest. However, existing algorithms exhibit a gap in terms of training paradigm, online detection paradigm, and evaluation criteria when compared to the actual needs of real-world industrial systems. Firstly, current algorithms typically train a specific model for each individual time series. In a large-scale online system with tens of thousands of curves, maintaining such a multitude of models is impractical. The performance of using merely one single unified model to detect anomalies remains unknown. Secondly, most TSAD models are trained on the historical part of a time series and are tested on its future segment. In distributed systems, however, there are frequent system deployments and upgrades, with new, previously unseen time series emerging daily. The performance of testing newly incoming unseen time series on current TSAD algorithms remains unknown. Lastly, although some papers have conducted detailed surveys, the absence of an online evaluation platform prevents answering questions like "Who is the best at anomaly detection at the current stage?" In this paper, we propose TimeSeriesBench, an industrial-grade benchmark that we continuously maintain as a leaderboard. On this leaderboard, we assess the performance of existing algorithms across more than 168 evaluation settings combining different training and testing paradigms, evaluation metrics and datasets. Through our comprehensive analysis of the results, we provide recommendations for the future design of anomaly detection algorithms. To address known issues with existing public datasets, we release an industrial dataset to the public together with TimeSeriesBench. All code, data, and the online leaderboard have been made publicly available.

翻译：受真实世界应用场景与规模激增的驱动，时序异常检测（TSAD）已引发学术界与工业界的高度关注。然而，现有算法在训练范式、在线检测范式及评估标准方面与真实工业系统的实际需求存在差距。首先，当前算法通常为每条时间序列单独训练特定模型，而在拥有数万条曲线的规模化在线系统中，维护如此大量的模型并不实际。仅使用单一统一模型检测异常的性能表现尚不明确。其次，多数TSAD模型采用时间序列历史部分训练、未来片段测试的方式。但在分布式系统中，系统部署与升级频繁，每日均有前所未见的新时间序列涌现。当前TSAD算法对这类新兴未知时间序列的测试性能仍是未知数。最后，尽管部分论文开展了详尽调研，但缺乏在线评估平台使得"当前阶段谁才是最佳异常检测算法"这一问题悬而未决。本文提出TimeSeriesBench——一个持续维护的工业级基准排行榜。在该排行榜上，我们通过168余种结合不同训练/测试范式、评估指标与数据集的配置组合，全面评估现有算法性能。基于对结果的综合分析，我们为未来异常检测算法的设计提供了建议。针对现有公开数据集已知问题，我们随TimeSeriesBench同步发布了一个工业数据集。所有代码、数据及在线排行榜均已公开。