Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
翻译:可验证奖励的强化学习(RLVR)已显著提升了多模态大语言模型(MLLM)的推理能力。然而,现有RLVR方法通常依赖结果驱动型优化,仅根据最终答案的共享奖励同时更新感知与推理模块。这种共享奖励模糊了信用分配机制,虽能改善推理模式,却常无法稳健提升上游视觉证据提取的准确性。为解决这一感知瓶颈,我们提出PRCO(感知-推理协同进化),一种基于共享策略的双角色RLVR框架。PRCO包含两个协作角色:观察者根据问题生成定制化证据描述,求解者则基于该描述预测最终答案。核心在于,PRCO采用角色专属奖励信号:求解者通过最终答案的可验证结果奖励进行优化,而观察者则获得源于求解者下游任务成功的效用奖励。在八个具有挑战性的多模态推理基准上的大量实验表明,PRCO在各模型规模上均实现了一致性提升,相较于基础模型平均准确率提升超过7个百分点,性能优于此前开源强化学习调优基线模型。