LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Experiments on three diverse video datasets - EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition) with multiple target models covering both classification and regression tasks, demonstrate that LD-ViCE generalizes well and achieves state-of-the-art performance. On the EchoNet-Dynamic dataset, LD-ViCE achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, while the refinement stage further improves perceptual quality. Qualitative analyses confirm that LD-ViCE produces semantically meaningful and temporally coherent explanations, providing actionable insights into model behavior. LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations.

翻译：基于视频的人工智能系统在自动驾驶和医疗保健等安全关键领域正日益普及。然而，由于视频数据固有的时空复杂性以及深度学习模型的不透明性，解释其决策仍然具有挑战性。现有的解释技术通常存在时间连贯性有限和缺乏可操作的因果洞察等问题。当前的反事实解释方法通常未纳入目标模型的指导，从而降低了语义保真度和实际效用。我们提出了用于视频反事实解释的潜在扩散模型（LD-ViCE），这是一个旨在解释基于视频的人工智能模型行为的新型框架。与先前方法相比，LD-ViCE通过使用最先进的扩散模型在潜在空间中操作，降低了生成解释的计算成本，同时通过额外的细化步骤产生逼真且可解释的反事实。在三个不同的视频数据集——EchoNet-Dynamic（心脏超声）、FERV39k（面部表情）和Something-Something V2（动作识别）上，使用覆盖分类和回归任务的多个目标模型进行的实验表明，LD-ViCE具有良好的泛化能力并实现了最先进的性能。在EchoNet-Dynamic数据集上，LD-ViCE实现了比先前方法显著更高的回归精度，并表现出高时间一致性，而细化阶段进一步提高了感知质量。定性分析证实，LD-ViCE能够产生语义上有意义且时间上连贯的解释，为模型行为提供了可操作的洞察。LD-ViCE通过视觉连贯的反事实解释，提升了基于视频的人工智能系统的可信度和可解释性。