Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
翻译:多模态大语言模型在视觉-语言任务中展现出卓越性能,但其处理长视频时受限于输入上下文长度及高计算成本。因此,稀疏帧采样成为必要的预处理步骤,采样帧质量直接影响下游任务性能。现有关键帧搜索算法在效率与采样帧质量之间取得平衡,但过度依赖单一视觉模态,导致难以适应文本相关任务,且检索结果易偏离核心语义内容。为此,我们提出视觉-字幕融合框架VSI,一种多模态关键帧检索框架。该框架采用视频搜索与字幕匹配相结合的双分支协同检索方法,融合互补的视觉与文本信息以实现精准定位。在LongVideoBench和VideoMME上的实验表明,VSI在关键帧检索中达到最先进准确率,同时在文本相关任务中取得突破性性能,并在其他任务中展现出强泛化能力。