Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
翻译:大型视觉-语言模型(VLM)正越来越多地应用于长视频问答任务,但推理过程往往受限于输入帧数量及由此产生的视觉标记数。简单的稀疏采样可能遗漏关键瞬间,而纯粹基于相关性的帧选择则常陷入近似重复帧的局部选择,牺牲了对时序上离散证据的覆盖。我们提出一种面向问题自适应的贪婪帧选择方法,在固定帧预算下联合优化查询相关性与语义代表性。该方法构建一个1帧/秒的候选池(上限1000帧),实现精确时间戳对齐;在两个互补嵌入空间(SigLIP用于问题相关性编码,DINOv2用于语义相似度编码)中嵌入候选帧;通过贪婪最大化模块化相关性项与设施选址覆盖项加权和的方式完成帧选择。该优化目标具有归一化、单调性和子模性,可实现标准的(1-1/e)贪婪近似保证。针对相关性-覆盖权衡的任务依赖特性,我们引入四种预设策略及轻量级纯文本问题类型分类器,将每个查询路由至其最优预设策略。在MLVU数据集上的实验表明,该方法在不同帧预算下均持续优于均匀采样及近期强基线方法,尤其在低预算设置下提升最为显著。