Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce $\textbf{MoTIF}$ (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations. Code available at $\href{https://github.com/patrick-knab/MoTIF}{github.com/patrick-knab/MoTIF}$.
翻译:概念瓶颈模型(CBMs)通过围绕人类可理解的概念构建预测来实现可解释的图像分类,但由于提取概念并随时间建模的困难,将这一范式扩展到视频领域仍然具有挑战性。本文提出 $\textbf{MoTIF}$(Moving Temporal Interpretable Framework),一种基于Transformer的概念架构,该架构通过对时序锚定的概念激活序列进行操作,采用每个概念的时间自注意力机制来建模个体概念何时重现及其时间模式如何影响预测。该框架的核心是一个自主概念发现模块,用于自动从视频中提取以对象和动作为中心的文本概念,从而在没有人工监督的情况下生成具有时间表达力的概念集。在多个视频基准测试中,这种组合显著缩小了可解释模型与黑盒视频模型之间的性能差距,同时保持了忠实且时序锚定的概念解释。代码可在 $\href{https://github.com/patrick-knab/MoTIF}{github.com/patrick-knab/MoTIF}$ 获取。