Vision-language models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task complexity. To address this, we present FrameOracle, a lightweight, plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained via a curriculum that progresses from weak proxy signals, such as cross-modal similarity, to stronger supervision with FrameOracle-41K, the first large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames per question. Extensive experiments across five VLMs and six benchmarks show that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without accuracy loss. When starting from 64-frame candidates, it reduces inputs to 13.9 frames on average while improving accuracy by 1.5%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.
翻译:视觉-语言模型(VLM)推动了视频理解的发展,但在严格的计算预算下运作,其性能取决于能否选取少量高质量的子帧集。现有的帧采样策略(如均匀采样或固定预算选择)无法适应内容密度或任务复杂性的变化。为解决此问题,我们提出FrameOracle——一种轻量级、即插即用的模块,可同时预测:(1)哪些帧与给定查询最相关;(2)需要多少帧。FrameOracle通过课程学习策略进行训练,从弱代理信号(如跨模态相似性)逐步过渡到更强监督信号(即FrameOracle-41K——首个包含经验证的关键帧标注的大规模视频问答数据集,每道问题均标注了最小足够帧数)。在五个VLM和六个基准上的广泛实验表明,FrameOracle可将16帧的输入平均减少至10.4帧,且不损失准确率。当从64帧候选帧出发时,它可将输入平均减少至13.9帧,同时准确率提升1.5%,实现了可扩展视频理解中效率-准确率的最优权衡。