We developed REVEX, a removal-based video explanations framework. This work extends fine-grained explanation frameworks for computer vision data and adapts six existing techniques to video by adding temporal information and local explanations. The adapted methods were evaluated across networks, datasets, image classes, and evaluation metrics. By decomposing explanation into steps, strengths and weaknesses were revealed in the studied methods, for example, on pixel clustering and perturbations in the input. Video LIME outperformed other methods with deletion values up to 31\% lower and insertion up to 30\% higher, depending on method and network. Video RISE achieved superior performance in the average drop metric, with values 10\% lower. In contrast, localization-based metrics revealed low performance across all methods, with significant variation depending on network. Pointing game accuracy reached 53\%, and IoU-based metrics remained below 20\%. Drawing on the findings across XAI methods, we further examine the limitations of the employed XAI evaluation metrics and highlight their suitability in different applications.
翻译:我们开发了REVEX,一种基于移除的视频解释框架。本研究扩展了计算机视觉数据的细粒度解释框架,并通过引入时序信息和局部解释,将六种现有技术适配至视频领域。改进后的方法在多种网络架构、数据集、图像类别及评估指标上进行了系统评估。通过将解释过程分解为多个步骤,揭示了所研究方法在像素聚类和输入扰动等方面的优势与局限。实验表明,根据具体方法和网络的不同,Video LIME在删除测试中较其他方法降低达31%,在插入测试中提升达30%。Video RISE在平均下降指标上表现最优,数值降低10%。然而,基于定位的评估指标显示所有方法性能均较低,且结果随网络结构波动显著:指向游戏准确率最高达53%,而基于交并比的指标均低于20%。基于对不同可解释人工智能方法的综合分析,我们进一步探讨了现有评估指标的局限性,并阐明了它们在不同应用场景中的适用性。