Despite the growing body of work on explainable machine learning in time series classification (TSC), it remains unclear how to evaluate different explainability methods. Resorting to qualitative assessment and user studies to evaluate explainers for TSC is difficult since humans have difficulties understanding the underlying information contained in time series data. Therefore, a systematic review and quantitative comparison of explanation methods to confirm their correctness becomes crucial. While steps to standardized evaluations were taken for tabular, image, and textual data, benchmarking explainability methods on time series is challenging due to a) traditional metrics not being directly applicable, b) implementation and adaption of traditional metrics for time series in the literature vary, and c) varying baseline implementations. This paper proposes XTSC-Bench, a benchmarking tool providing standardized datasets, models, and metrics for evaluating explanation methods on TSC. We analyze 3 perturbation-, 6 gradient- and 2 example-based explanation methods to TSC showing that improvements in the explainers' robustness and reliability are necessary, especially for multivariate data.
翻译:尽管时间序列分类(TSC)中可解释机器学习的研究日益增多,但如何评估不同可解释性方法仍不明确。由于人类难以理解时间序列数据中包含的深层信息,依赖定性评估和用户研究来评价TSC可解释性方法具有较大难度。因此,对可解释性方法进行系统性综述与定量比较以验证其正确性变得至关重要。虽然表格、图像和文本数据已迈出标准化评估的步骤,但时间序列可解释性方法的基准测试仍面临挑战:a)传统指标无法直接适用,b)文献中传统指标针对时间序列的实现与适配方式各异,c)基线实现方案不统一。本文提出XTSC-Bench基准工具,为TSC可解释性方法评估提供标准化数据集、模型和指标。我们分析了3种基于扰动、6种基于梯度和2种基于示例的可解释性方法在TSC中的应用,结果表明可解释性方法在鲁棒性和可靠性方面仍需改进,尤其对于多变量数据。