Find, Fix, Reason: Context Repair for Video Reasoning

Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.

翻译：强化学习推动了大规模多模态模型在视频推理中的进展，但主流框架要么依赖在线策略自我探索（在模型知识边界处趋于平稳），要么采用混合重放（需混合策略并精心正则化）。动态上下文方法聚焦于关键证据，但通常需要精心设计的预训练和两阶段微调，且其上下文仍受限于小型模型的能力。相比之下，大型模型更擅长指令遵循与多模态理解，能为小型模型提供更丰富的上下文，并通过简单工具快速聚焦目标区域。基于这一能力，我们提出一种观测级别的干预方法：冻结的、集成工具的教师模型识别缺失的时空依赖关系，从原始视频中提取最小证据片段（如时间戳、区域等），同时保持问题不变。学生模型在补充上下文后重新回答，并通过集成到组相对策略优化（GRPO）中的选定展开策略进行训练更新。我们进一步提出鲁棒改进奖励（RIR），使优化对齐两个目标：通过正确答案实现结果有效性，以及通过反映引用证据的推理过程实现依赖对齐。优势值在批次内进行组归一化，在保留在线策略探索的同时，将其引导至因果相关的方向，且几乎不改变训练架构。在多个相关基准上的实验表明，该方法具有一致的准确率提升和强泛化能力。网页和源代码将于 https://github.com/JethroJames/FFR.git 公开。