Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering. The datasets, code, and models are available at https://github.com/HCPLab-SYSU/CMCIR.
翻译:现有视觉问答方法常受跨模态虚假相关性影响,且事件级推理过程过于简化,难以捕捉跨视频的事件时序性、因果性和动态性。针对事件级视觉问答任务,本文提出了一种跨模态因果关系推理框架。具体而言,我们引入一组因果干预操作来发现视觉与语言模态间的隐含因果结构。所提出的跨模态因果推理框架(CMCIR)包含三个模块:i) 因果感知视觉语言推理模块,通过前门与后门因果干预协同解耦视觉与语言的虚假相关性;ii) 时空Transformer模块,捕捉视觉与语言语义间的细粒度交互;iii) 视觉语言特征融合模块,自适应学习全局语义感知的视觉语言表征。在四个事件级数据集上的大量实验证明,CMCIR在发现视觉语言因果结构及实现鲁棒的事件级视觉问答方面具有优越性。数据集、代码和模型可通过https://github.com/HCPLab-SYSU/CMCIR获取。