Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments, which will be pseudo-labels. With these pseudo-labels as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several VideoQA benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.
翻译:视频问答旨在根据视频中观察到的信息回答自然语言问题。尽管大型多模态模型在图像语言理解与推理方面近期取得了成功,但其处理视频问答时往往仅将均匀采样的帧作为视觉输入,忽略了与问题相关的视觉线索,导致效果不佳。此外,现有视频问答数据集中缺乏问题关键时间戳的人工标注。针对这一问题,本文提出了一种新颖的弱监督框架,强制大型多模态模型以问题关键时刻作为视觉输入进行答案推理。具体而言,我们将问题与答案对融合为事件描述,定位多个关键帧作为目标时刻,并将其作为伪标签。借助这些伪标签作为额外的弱监督信号,我们设计了一个轻量级基于高斯函数的对比定位模块。该模块通过学习多个高斯函数刻画视频的时间结构,并采样问题关键帧作为正样本时刻,从而作为大型多模态模型的视觉输入。在多个视频问答基准上的大量实验验证了本框架的有效性,相较于现有最优方法,我们取得了显著性能提升。