The field of time series anomaly detection is constantly advancing, with several methods available, making it a challenge to determine the most appropriate method for a specific domain. The evaluation of these methods is facilitated by the use of metrics, which vary widely in their properties. Despite the existence of new evaluation metrics, there is limited agreement on which metrics are best suited for specific scenarios and domain, and the most commonly used metrics have faced criticism in the literature. This paper provides a comprehensive overview of the metrics used for the evaluation of time series anomaly detection methods, and also defines a taxonomy of these based on how they are calculated. By defining a set of properties for evaluation metrics and a set of specific case studies and experiments, twenty metrics are analyzed and discussed in detail, highlighting the unique suitability of each for specific tasks. Through extensive experimentation and analysis, this paper argues that the choice of evaluation metric must be made with care, taking into account the specific requirements of the task at hand.
翻译:时间序列异常检测领域持续进步,已有多种方法可用,但确定特定领域最适用的方法仍颇具挑战。评估这些方法需借助度量工具,各类度量在属性上差异显著。尽管新评估度量不断涌现,但对于特定场景和领域最适配的度量尚未达成共识,且最常用度量在文献中屡遭诟病。本文全面综述了用于评估时间序列异常检测方法的度量体系,并依据计算方式构建分类框架。通过定义评估度量的一组属性及具体案例研究与实验,对二十种度量进行深入剖析与讨论,揭示了每种度量在特定任务中的独特适用性。基于广泛实验与分析,本文论证了评估度量的选择须审慎考量,需充分考虑具体任务需求。