Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? Hence, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to critically analyze multimodal representations. QUAG facilitates combined dataset-model study by systematic ablation of model's coupled multimodal understanding during inference. Surprisingly, it demonstrates that the models manage to maintain high performance even under multimodal impairment. We extend QUAG to design "QUAG-attention", a simplistic and less-expressive replacement of self-attention. We find that the models with QUAG-attention achieve similar performance with significantly less mulops without any finetuning. These findings indicate that the current VideoQA benchmarks and metrics do not penalize models that find shortcuts and discount joint multimodal understanding. Motivated by this, we propose the $\textit{CLAVI}$ (Counterfactual in LAnguage and VIdeo), a diagnostic dataset for coupled multimodal understanding in VideoQA. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have poor performance on the counterfactual instances that necessitate joint multimodal understanding. Overall, with the multimodal representation analysis using QUAG and diagnostic analysis using CLAVI, we show that many VideoQA models are incapable of learning multimodal representations and that their success on standard datasets is an illusion of joint multimodal understanding.

翻译：尽管视频问答Transformer模型在标准基准测试中展现出竞争性性能，但其成功背后的原因尚未完全明确。这些模型是否真正联合捕获并利用了视频与文本中丰富的多模态结构与动态特征？还是仅仅通过利用捷径来获得高分？为此，我们设计了$\textit{QUAG}$（象限平均），一种轻量且非参数化的探测工具，以批判性地分析多模态表示。QUAG通过系统性地消融模型在推理过程中的耦合多模态理解，促进了数据集与模型的联合研究。令人惊讶的是，它表明即使在多模态受损的情况下，模型仍能维持较高性能。我们将QUAG扩展至“QUAG注意力”，一种简化且表达能力较弱的多头注意力替代方案。我们发现，使用QUAG注意力的模型在不进行任何微调的情况下，以显著更少的操作次数实现了相近的性能。这些结果表明，当前视频问答基准与评价指标并未有效惩处那些寻找捷径而忽略联合多模态理解的模型。受此启发，我们提出$\textit{CLAVI}$（语言与视频中的反事实），一个针对视频问答耦合多模态理解的诊断数据集。CLAVI包含经增强处理的时序问题与视频，旨在语言与视频域中构建平衡的反事实实例。我们在CLAVI上评估模型发现，所有模型在多模态捷径实例上均表现优异，但大多在需要联合多模态理解的反事实实例上性能低下。总体而言，通过QUAG的多模态表示分析与CLAVI的诊断分析，我们证明许多视频问答模型无法真正学习多模态表示，其在标准数据集上的成功实则为联合多模态理解的一种幻象。