VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/

翻译：近期提出的“视频推理”范式利用视频生成模型（VGM）生成时间一致的可视化轨迹以完成推理任务。尽管最先进的VGM在视觉质量上表现出色，但它们往往难以理解和遵循特定任务规则，导致在多样化推理场景中出现逻辑失败。现有研究尝试利用视觉语言模型（VLM）作为问题预求解器，为VGM生成或优化文本指导。然而，文本描述无法捕捉复杂的时空细节，且即便拥有有效规划，VGM仍难以忠实执行细粒度或长尾指令。虽然VLM作为求解器存在不足，但它们具备强大的感知能力，可评估过程约束满足度与最终目标达成度。借助这一优势，我们提出范式转变，将VLM的角色转变为“教师”。具体而言，VLM教师提取任务特定规则以构建可微分奖励，通过测试时在线优化轻量级LoRA模块来指导VGM推理器。该策略实现自适应测试时优化，并将推理能力扩展至VGM固有边界之外。在符号化视频推理基准（VBVR-Bench）与通用基准（RULER-Bench）上的评估表明，所提方法平均性能提升16.7分，在可比测试时成本下大幅超越VLM作为求解器范式（+0.4分）与Best-of-N扩展方法（+2.2分）。这些发现揭示，将VLM整合为测试时教师，为实现通用视频推理提供了一种有前景的范式。项目页面：https://VLM-as-Teacher.github.io/