Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rely on questions, $\textit{i.e.}$, $\textit{linguistic bias}$, while ignoring visual content. This is also known as `ungrounded guesses' or `hallucinations'. To address this problem while leveraging LLMs' prior on VideoQA, we propose a novel framework, Flipped-VQA, encouraging the model to predict all the combinations of $\langle$V, Q, A$\rangle$ triplet by flipping the source pair and the target label to understand their complex relationships, $\textit{i.e.}$, predict A, Q, and V given a VQ, VA, and QA pairs, respectively. In this paper, we develop LLaMA-VQA by applying Flipped-VQA to LLaMA, and it outperforms both LLMs-based and non-LLMs-based models on five challenging VideoQA benchmarks. Furthermore, our Flipped-VQA is a general framework that is applicable to various LLMs (OPT and GPT-J) and consistently improves their performances. We empirically demonstrate that Flipped-VQA not only enhances the exploitation of linguistic shortcuts but also mitigates the linguistic bias, which causes incorrect answers over-relying on the question. Code is available at https://github.com/mlvlab/Flipped-VQA.
翻译:大型语言模型(LLMs)在广泛自然语言理解与生成任务中展现出卓越性能。我们观察到,LLMs在视频问答(VideoQA)的时间与因果推理中,能有效利用$\textit{语言捷径}$(linguistic shortcuts)提供先验知识。然而,这种先验知识往往导致模型过度依赖问题(即$\textit{语言偏差}$,linguistic bias),忽视视觉内容,从而在VideoQA中产生次优结果——这被称为"无根据猜测"(ungrounded guesses)或"幻觉"(hallucinations)。为解决这一问题并充分发挥LLMs在VideoQA中的先验优势,我们提出新型框架Flipped-VQA,通过翻转源对与目标标签,迫使模型预测$\langle$V, Q, A$\rangle$三元组的所有组合(即分别基于VQ、VA、QA对预测A、Q、V),从而理解其复杂关系。本文基于LLaMA应用Flipped-VQA开发了LLaMA-VQA模型,在五个具有挑战性的VideoQA基准测试中,该模型性能超越基于LLMs与非LLMs的各类模型。此外,Flipped-VQA作为通用框架,可适用于多种LLMs(OPT与GPT-J)并持续提升其性能。实验证明,Flipped-VQA不仅能增强对语言捷径的利用,更能有效缓解因过度依赖问题导致的错误回答(即语言偏差)。代码开源地址:https://github.com/mlvlab/Flipped-VQA。