Video question--answering is a fundamental task in the field of video understanding. Although current vision--language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image--text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video--question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image--text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at https://github.com/declare-lab/sas-vqa.
翻译:视频问答是视频理解领域的一项基础任务。尽管当前配备了视频Transformer(Video Transformers)的视觉-语言模型(VLMs)能够实现时序建模并取得优异结果,但这是以巨大的计算成本为代价的,因此难以在实时应用场景中部署。一种经济高效的替代方案仅采样少量帧来代表视频的主要内容,并在这些采样帧上微调图像-文本模型。当前的视频理解模型通常随机采样一组帧或片段,而忽略了其视觉内容之间的内在关联性,也忽视了它们与问题的相关性。我们认为,这种无目标的采样可能会遗漏那些能推导出正确答案的关键帧,并且随着采样稀疏度的增加(这通常随视频长度增长而出现),情况会变得更糟。为缓解此问题,我们提出了两种帧采样策略,即最领域帧(MDF)和最蕴含帧(MIF),以最大程度保留那些对给定问题最可能至关重要的帧。MDF通过自举方式被动地最小化关键帧遗漏的风险,而MIF则借助辅助模型主动搜索为每个视频-问题对定制的关键帧。在来自三个先进视觉-语言模型(CLIP、GIT和All-in-one)的三个公开数据集上的实验结果表明,我们提出的策略能够提升图像-文本预训练模型的性能。本文所提方法的源代码已在https://github.com/declare-lab/sas-vqa 上公开。