Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.
翻译:大型多模态模型(LMMs)的最新进展已实现有效的视觉-语言推理,然而尽管视频专用LMMs快速发展,其视频理解能力仍受限于次优的帧选择策略。先前研究尝试通过静态启发式方法或外部检索模块来提供帧级信息,但这类方法往往无法捕捉与用户查询直接相关的视觉线索,混淆了原始视觉动态与真实语义相关性。本文提出ReFoCUS(基于强化引导的帧优化方法实现上下文理解),这是首个将在线策略梯度强化学习整合到视频-LLMs帧级优化中的框架。ReFoCUS旨在学习帧选择策略,利用参考模型提供的奖励信号来捕捉其对最佳支持时间相关响应的帧组合的潜在评分行为。为有效探索大规模组合帧空间,我们采用自回归且查询条件化的选择架构,在降低复杂度的同时确保上下文一致性。该策略学习无需显式帧级监督,即可隐式发现最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提升推理准确率,论证了将帧选择与模型内在效用对齐的优势。