Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.
翻译:视频问答(VideoQA)旨在基于视频中观察到的信息回答自然语言问题。尽管大型多模态模型(LMMs)在图像语言理解与推理方面取得近期成功,但它们通过简单均匀采样帧作为视觉输入的方式处理视频问答时仍存在不足,忽略了与问题相关的视觉线索。此外,现有视频问答数据集中缺乏问题关键时间戳的人工标注。鉴于此,我们提出一种新颖的弱监督框架,强制LMMs以问题关键时刻作为视觉输入来推理答案。具体而言,我们首先将问题-答案对融合为事件描述,借助CLIP模型的视觉-语言对齐能力,定位多个关键帧作为目标时刻与伪标签。利用这些伪标签关键帧作为额外弱监督,我们设计了轻量级的高斯对比定位(GCG)模块。GCG学习多个高斯函数以刻画视频的时间结构,并采样问题关键帧作为正时刻,作为LMMs的视觉输入。在多个基准上的大量实验验证了本框架的有效性,相较于先前最先进方法,我们取得了显著改进。