Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments, which will be pseudo-labels. With these pseudo-labels as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several VideoQA benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.
翻译:视频问答(VideoQA)旨在基于视频中观察到的信息回答自然语言问题。尽管大型多模态模型(LMMs)在图像语言理解与推理方面近期取得了成功,但它们简单地采用均匀采样视频帧作为视觉输入来处理VideoQA任务,这忽略了与问题相关的视觉线索。此外,现有VideoQA数据集中缺乏问题关键时间戳的人工标注。鉴于此,我们提出一种新颖的弱监督框架,强制LMMs以问题关键时刻作为视觉输入来推理答案。具体而言,我们将问题与答案对融合作为事件描述,以定位多个关键帧作为目标时刻,这些关键帧将充当伪标签。利用这些伪标签作为额外的弱监督,我们设计了一个轻量级的高斯对比定位(GCG)模块。GCG学习多个高斯函数来描述视频的时间结构,并采样问题关键帧作为正时刻,作为LMMs的视觉输入。在多个VideoQA基准上的大量实验验证了我们框架的有效性,且与先前最先进方法相比,我们实现了显著的性能提升。