Video Large Language Models (Video-LLMs) have made remarkable progress in video understanding tasks. However, they are constrained by the maximum length of input tokens, making it impractical to input entire videos. Existing frame selection approaches, such as uniform frame sampling and text-frame retrieval, fail to account for the information density variations in the videos or the complex instructions in the tasks, leading to sub-optimal performance. In this paper, we propose Frame-Voyager that learns to query informative frame combinations, based on the given textual queries in the task. To train Frame-Voyager, we introduce a new data collection and labeling pipeline, by ranking frame combinations using a pre-trained Video-LLM. Given a video of M frames, we traverse its T-frame combinations, feed them into a Video-LLM, and rank them based on Video-LLM's prediction losses. Using this ranking as supervision, we train Frame-Voyager to query the frame combinations with lower losses. In experiments, we evaluate Frame-Voyager on four Video Question Answering benchmarks by plugging it into two different Video-LLMs. The experimental results demonstrate that Frame-Voyager achieves impressive results in all settings, highlighting its potential as a plug-and-play solution for Video-LLMs.
翻译:视频大语言模型(Video-LLMs)在视频理解任务中取得了显著进展。然而,它们受限于输入令牌的最大长度,导致无法输入完整视频。现有的帧选择方法(如均匀帧采样和文本-帧检索)未能考虑视频中信息密度的变化以及任务中复杂的指令,从而导致次优性能。本文提出Frame-Voyager,该方法能够根据任务中给定的文本查询,学习查询信息量丰富的帧组合。为训练Frame-Voyager,我们引入了一种新的数据收集与标注流程,通过使用预训练Video-LLM对帧组合进行排序来实现。对于包含M帧的视频,我们遍历其T帧组合,将其输入Video-LLM,并根据Video-LLM的预测损失对组合进行排序。利用该排序作为监督信号,我们训练Frame-Voyager以查询损失较低的帧组合。在实验中,我们将Frame-Voyager接入两种不同的Video-LLM,在四个视频问答基准数据集上进行评估。实验结果表明,Frame-Voyager在所有设定下均取得优异性能,彰显其作为Video-LLM即插即用解决方案的潜力。