Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: https://dissect-videoqa.github.io ).

翻译：尽管视频问答Transformer模型在标准基准测试中展现出具有竞争力的性能，但其成功背后的原因尚未被完全理解。这些模型是否真正捕捉到了视频与文本之间丰富的多模态结构和动态关联？抑或是通过利用数据偏差和伪特征来获得高分？为此，我们设计了$\textit{QUAG}$（象限平均法），一种轻量级非参数化探测方法，通过削弱模态融合来开展数据集-模型联合表征分析。我们发现，许多模型在多个数据集上实现高性能时并未充分利用多模态表征。为进一步验证QUAG，我们设计了$\textit{QUAG-attention}$——一种表达能力受限、具有约束性令牌交互的自注意力替代机制。采用QUAG-attention的模型在未进行任何微调的情况下，仅以显著减少的乘法运算量就达到了相近的性能水平。我们的发现对当前模型学习高度耦合多模态表征的能力提出了质疑。为此，我们构建了$\textit{CLAVI}$（语言与视频互补）数据集，这是一个通过增强现实世界视频以具备高模态耦合性而构建的压力测试数据集。与QUAG的发现一致，我们发现大多数模型在CLAVI上仅能达到接近随机猜测的性能。这再次证实了当前模型在学习高度耦合多模态表征方面的局限性，而这一缺陷未被现有数据集所评估（项目页面：https://dissect-videoqa.github.io）。