Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.
翻译:[translated abstract in Chinese]
基于可验证奖励的强化学习(RLVR)已被证明是为大型语言模型(LLMs)赋予强大多步推理能力的高效策略。然而,其设计与优化仍局限于纯文本领域,导致在应用于多模态推理任务时表现欠佳。特别地,我们观察到当前多模态推理的主要误差来源在于视觉输入的感知过程。为攻克这一瓶颈,我们提出PAPO——一种新颖的策略梯度算法,该算法鼓励模型在学习推理的同时习得感知能力。具体而言,我们引入隐式感知损失,以KL散度项的形式呈现,可无缝集成至GRPO、DAPO等主流RLVR算法中。值得注意的是,PAPO无需额外数据整理、奖励模型或更强的教师模型支持。为进一步提升PAPO的训练稳定性,我们引入双熵损失,该损失可在不影响性能的前提下有效正则化新的KL目标项。尽管设计简洁,PAPO在多个多模态基准测试中实现了4.4%-17.5%的显著整体性能提升。在高度依赖视觉感知的任务上,性能提升更为显著,可达8.0%-19.1%。我们还观察到PAPO将感知误差降低了30.5%,表明该算法有效增强了感知能力。总体而言,本工作将感知感知监督更深层地融入核心学习目标,为构建鼓励视觉基础推理的新型强化学习框架奠定基础。代码与数据将面向研究用途公开发布。项目主页:https://mikewangwzhl.github.io/PAPO。