Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.
翻译:大型语言模型在自然语言理解与生成任务中展现出卓越性能。我们发现,大型语言模型能有效利用"语言捷径"进行视频问答中的时间与因果推理。然而,这种先验知识往往导致模型过度依赖问题文本(即"语言偏见")而忽视视觉内容,从而在视频问答中产生次优结果——即所谓的"无依据猜测"或"幻觉"。为解决此问题并保留大型语言模型在视频问答中的先验优势,我们提出创新框架Flipped-VQA,通过翻转源标签与目标标签,预测<视频,问题,答案>三元组的所有组合(即分别根据视频-问题对预测答案、根据视频-答案对预测问题、根据问题-答案对预测视频),以理解其复杂关系。本文将该框架应用于LLaMA模型构建LLaMA-VQA,在五个具有挑战性的视频问答基准测试中,其表现均优于基于及非基于大型语言模型的现有方法。此外,Flipped-VQA作为通用框架,可适用于多种大型语言模型(如OPT和GPT-J)并持续提升其性能。实验证明,Flipped-VQA不仅强化了对语言捷径的利用,还显著缓解了因过度依赖问题导致的错误答案(即语言偏见)。代码已开源:https://github.com/mlvlab/Flipped-VQA。