Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

翻译：近期研究表明，利用大规模预训练图像-语言模型进行视频问答已取得显著进展。尽管此类图像-语言模型能高效引导视频-语言模型的表征学习，但其通常将均匀采样的视频帧拼接作为视觉输入，缺乏显式的语言感知与时序建模。当视频输入中仅部分内容与语言查询相关时，这种均匀帧采样常导致重要视觉线索缺失。尽管人类通常会聚焦视频关键片段并回放该片段来回答问题，但训练查询感知的视频片段定位器往往需要昂贵的标注成本和高计算开销。为解决该问题，我们提出自链式视频定位-回答（SeViLA）框架，该框架利用单一图像-语言模型（BLIP-2）同时处理视频时序关键帧定位和问答任务。SeViLA框架包含两个模块：定位器与回答器，二者均基于BLIP-2进行参数高效微调。我们提出两种链式推理与自优化方式：首先，在前向链中，定位器在视频中寻找多个语言感知关键帧，供回答器预测答案；其次，在反向链中，回答器生成关键帧伪标签以优化定位器，从而免除昂贵的视频片段定位标注需求。我们的SeViLA框架在5个具有挑战性的视频问答与事件预测基准上超越多个强基线模型，在微调（NExT-QA, STAR）和零样本（NExT-QA, STAR, How2QA, VLEP）设置中均达到最优性能。我们还分析了定位器的影响、定位器与其他时序定位模型的对比、定位器的预训练/自优化机制以及关键帧数量变化的影响。