Explainable artificial intelligence techniques are becoming increasingly important with the rise of deep learning applications in various domains. These techniques aim to provide a better understanding of complex "black box" models and enhance user trust while maintaining high learning performance. While many studies have focused on explaining deep learning models in computer vision for image input, video explanations remain relatively unexplored due to the temporal dimension's complexity. In this paper, we present a unified framework for local agnostic explanations in the video domain. Our contributions include: (1) Extending a fine-grained explanation framework tailored for computer vision data, (2) Adapting six existing explanation techniques to work on video data by incorporating temporal information and enabling local explanations, and (3) Conducting an evaluation and comparison of the adapted explanation methods using different models and datasets. We discuss the possibilities and choices involved in the removal-based explanation process for visual data. The adaptation of six explanation methods for video is explained, with comparisons to existing approaches. We evaluate the performance of the methods using automated metrics and user-based evaluation, showing that 3D RISE, 3D LIME, and 3D Kernel SHAP outperform other methods. By decomposing the explanation process into manageable steps, we facilitate the study of each choice's impact and allow for further refinement of explanation methods to suit specific datasets and models.
翻译:可解释人工智能技术随着深度学习应用在多个领域的兴起而日益重要。这些技术旨在提供对复杂"黑箱"模型更深入的理解,并在保持高学习性能的同时增强用户信任。尽管许多研究聚焦于计算机视觉中图像输入的深度学习模型解释,但由于时间维度的复杂性,视频解释仍相对未被充分探索。本文提出一个面向视频领域的统一局部不可知解释框架。我们的贡献包括:(1) 扩展一个专为计算机视觉数据设计的细粒度解释框架,(2) 通过融入时间信息并支持局部解释,将六种现有解释技术适配到视频数据,以及(3) 使用不同模型和数据集对适配后的解释方法进行评估与比较。我们探讨了基于移除的可视数据解释过程中涉及的可行性与选择,详细说明了六种视频解释方法的适配过程,并与现有方法进行了对比。通过自动化指标与用户评估对方法性能进行评测,结果表明3D RISE、3D LIME和3D Kernel SHAP优于其他方法。通过将解释过程分解为可管理的步骤,我们有助于研究每个选择的影响,并允许进一步优化解释方法以适配特定数据集和模型。