Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success remain unclear. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? We analyze this with $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe that systematically ablates the model's coupled multimodal understanding during inference. Surprisingly, QUAG reveals that the models manage to maintain high performance even when injected with multimodal sub-optimality. Additionally, even after replacing self-attention in multimodal fusion blocks with "QUAG-attention", a simplistic and less-expressive variant of self-attention, the models maintain high performance. This means that current VideoQA benchmarks and their metrics do not penalize shortcuts that discount joint multimodal understanding. Motivated by this, we propose the $\textit{CLAVI}$ (Counterfactual in LAnguage and VIdeo) benchmark, a diagnostic dataset for benchmarking coupled multimodal understanding in VideoQA through counterfactuals. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. Hence, it incentivizes, and identifies the reliability of learnt multimodal representations. We evaluate CLAVI and find that models achieve high performance on multimodal shortcut instances, but have very poor performance on the counterfactuals. Hence, we position CLAVI as a litmus test to identify, diagnose and improve the sub-optimality of learnt multimodal VideoQA representations which the current benchmarks are unable to assess.

翻译：尽管视频问答 Transformer 模型在标准基准测试中展现出有竞争力的性能，但其成功背后的原因仍不明确。这些模型是否真正联合捕获并利用了视频与文本中的丰富多模态结构与动态？抑或仅仅是通过利用捷径来获得高分？我们通过$\textit{QUAG}$（象限平均，一种轻量级非参数探测方法）对此进行分析，该方法在推理过程中系统地消融模型的耦合多模态理解能力。令人惊讶的是，QUAG 揭示出：即使在注入多模态次优性时，模型仍能保持高性能。此外，即使将多模态融合块中的自注意力替换为"QUAG注意力"——一种简化且表达能力较弱的自注意力变体——模型仍能维持高性能。这意味着当前视频问答基准及其评价指标并未惩罚那些牺牲联合多模态理解的捷径策略。受此启发，我们提出$\textit{CLAVI}$（语言与视频中的反事实）基准，这是一个通过反事实来评估视频问答中耦合多模态理解的诊断数据集。CLAVI 包含经过增强的时间性问题及视频，以在语言和视频领域中构建平衡的反事实样本，从而激励并识别学习到的多模态表示的可靠性。我们评估了CLAVI，发现模型在多模态捷径实例上表现优异，但在反事实样本上性能极差。因此，我们将CLAVI定位为一项试金石测试，用于识别、诊断并改进学习到的视频问答多模态表示的次优性——而当前基准无法评估这些方面。