Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7\% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at https://github.com/zywang0104/Video-KTR.
翻译:强化学习(RL)在增强多模态大语言模型的推理能力方面展现出巨大潜力,然而现有的视频推理方法通常依赖于粗粒度的序列级奖励或单一因素的令牌选择,忽略了视觉输入、时序动态与语言输出之间的细粒度关联,从而限制了模型的准确性和可解释性。本文提出Video-KTR,一种模态感知的策略塑造框架,通过结合三种归因信号实现选择性、令牌级的强化学习:(1)通过反事实掩码识别视觉感知依赖的视觉感知令牌;(2)通过帧重排检测时序敏感性的时序感知令牌;(3)表征预测不确定性的高熵令牌。通过仅强化这些关键令牌,Video-KTR将学习聚焦于语义信息丰富、模态敏感的内容,同时过滤低价值令牌。在五个具有挑战性的基准测试中,Video-KTR取得了最先进或极具竞争力的结果,在Video-Holmes上达到42.7%(超越GPT-4o),并在推理和通用视频理解任务上均获得持续提升。消融研究验证了各归因信号的互补作用以及目标令牌级更新的鲁棒性。总体而言,Video-KTR提升了准确性与可解释性,为复杂视频推理提供了一种简单、即插即用的强化学习扩展方案。我们的代码与模型已发布于https://github.com/zywang0104/Video-KTR。