Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We chain these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. SeViLA outperforms several strong baselines/previous works on five video QA and event prediction tasks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We show a comprehensive analysis, e.g., the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

翻译：近期研究表明，利用预训练的图像-语言模型进行视频问答已取得显著成果。尽管这些图像-语言模型能高效提升视频-语言模型的表征学习能力，但它们通常将均匀采样的视频帧拼接为视觉输入，缺乏显式的语言感知与时序建模。当仅有部分视频输入与语言查询相关时，这种均匀帧采样方式往往导致重要视觉线索缺失。人类常通过聚焦视频关键片段并回溯该片段来回答问题，但训练查询感知的视频时刻定位器需要昂贵的标注与高计算成本。为解决该问题，我们提出自链视频定位-问答框架（SeViLA），该创新框架利用单一图像-语言模型（BLIP-2）同时处理视频时序关键帧定位与问答任务。SeViLA框架包含定位器与回答器两个模块，二者均从BLIP-2进行参数高效微调。我们通过级联推理与自优化机制串联这两个模块：首先在前向链中，定位器提取视频中多个语言感知关键帧，回答器基于这些关键帧预测答案；其次在反向链中，回答器生成关键帧伪标签以优化定位器，从而避免昂贵的视频时刻定位标注需求。SeViLA在五个视频问答与事件预测任务中优于多个强基线/先前工作，并在微调（NExT-QA、STAR）与零样本（NExT-QA、STAR、How2QA、VLEP）设置下均达到最优性能。我们进行了全面分析，包括定位器影响、定位器与其他时序定位模型的对比、定位器的预训练/自优化机制以及关键帧数量变化的影响等。