Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
翻译:通过强化学习(RL)进行后训练已通过测试时缩放显著提升了大型语言模型(LLM)的推理能力。然而,将这一范式通过冗长的推理过程扩展到多模态大语言模型(MLLM)时,对感知能力的提升有限,甚至可能导致性能下降。我们提出了强化注意力学习(RAL),一种直接优化内部注意力分布而非输出词元序列的策略梯度框架。通过将优化目标从“生成什么”转向“关注何处”,RAL促进了复杂多模态输入中有效的信息分配与更强的接地性。在多种图像与视频基准测试上的实验表明,RAL相较于GRPO及其他基线方法取得了持续的性能提升。我们进一步引入了同策略注意力蒸馏,证明迁移潜在的注意力行为比标准知识蒸馏能产生更强的跨模态对齐效果。我们的研究结果确立了注意力策略作为一种原则性且通用的多模态后训练替代方案。