Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.
翻译:现有视频问答(VideoQA)方法常受不同模态间虚假相关性影响,导致无法识别主导性视觉证据与预期问题。此外,这些方法作为黑箱模型运作,使得问答过程中视觉场景的解析困难重重。为发掘作为生成可靠答案的视觉因果场景的关键视频片段与帧,本文提出视频问答因果分析,并构建名为视觉因果场景精炼(VCSR)的跨模态因果关系推理框架。特别地,我们引入一组因果前门干预操作,以显式定位片段级和帧级视觉因果场景。VCSR包含两大核心模块:i) 问题引导精炼器(QGR)模块,在问题语义引导下精炼连续视频帧,获取更具表征性的片段特征以实施因果前门干预;ii) 因果场景分离器(CSS)模块,依据视觉-语言因果相关性发现视觉因果与非因果场景集合,并通过对比学习方式估计算法场景分离干预的因果效应。在NExT-QA、Causal-VidQA及MSRVTT-QA数据集上的广泛实验表明,本VCSR在发现视觉因果场景及实现鲁棒视频问答方面具有优越性。代码可在https://github.com/YangLiu9208/VCSR获取。