Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
翻译:多模态大语言模型(MLLMs)已从感知任务快速演进至复杂的多步推理任务,然而,基于可验证奖励的强化学习(RLVR)通常会导致虚假推理,因为仅对最终答案的正确性给予奖励。为克服这一局限,我们提出了AutoRubric-R1V框架,该框架通过自动收集基于量规的生成式奖励,将RLVR与过程级监督相结合。我们的核心创新在于一种可扩展的自聚合方法,该方法从成功轨迹中提炼出一致的推理检查点,从而无需人工标注或更强的教师模型即可构建针对特定问题的量规。通过联合利用基于量规的奖励与结果奖励,AutoRubric-R1V在六个多模态推理基准测试中取得了最先进的性能,并在专项评估中显著提升了推理的可信度。