Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.
翻译:基于可验证奖励的强化学习(RLVR)显著提升了多模态大语言模型(MLLMs)的推理能力。然而,现有RLVR方法通常依赖结果驱动优化,仅根据最终答案使用共享奖励来同时更新感知与推理过程。这种共享奖励模糊了信用分配,虽常能改善推理模式,却难以可靠地提升上游视觉证据提取的准确性。为解决这一感知瓶颈,我们提出了PRCO(感知-推理协同进化)——一种采用共享策略的双角色RLVR框架。PRCO包含两个协作角色:面向问题生成证据描述的观察者(Observer)和基于该描述预测最终答案的求解者(Solver)。关键在于,PRCO采用角色特定的奖励信号:求解者通过最终答案的可验证结果奖励进行优化,而观察者则获得基于求解者下游任务成功率的效用奖励。在八个具有挑战性的多模态推理基准测试上的广泛实验表明,与基础模型相比,PRCO在不同模型规模上实现了超过平均7个百分点的持续改进,其性能显著优于此前开源的RL微调基线模型。