Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.
翻译:大型多模态模型(LMMs)最近在视频问答(VideoQA)任务中展现出卓越性能,然而由于推理成本高昂和信息稀释,对视频进行推理仍然具有挑战性。关键帧选择能提高效率并实现更清晰的推理,但在仅依赖图像-文本相似度时,会面临监督稀疏和帧选择冗余的问题。我们提出了一种问题感知的关键帧选择框架,包含两个核心组件:从LMMs中推导出的伪关键帧标签,其提供了信息丰富的监督;以及一种覆盖正则化方法,旨在促进跨时间的多样化和互补性证据。在NExT-QA数据集上的实验表明,我们的方法显著提高了准确性,特别是对于时序性和因果性问题类型,从而确立了关键帧选择作为VideoQA中一个有效且可学习的模块。