Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.
翻译:基于可验证奖励的强化学习已显著提升大语言模型的推理能力,直觉上这一方法应能自然迁移至多模态模型。然而,多模态模型执行两类任务:首先感知图像中的内容,继而推理其隐含意义。由于这两个阶段被联合评分,我们难以单独衡量推理能力的增长空间。本文以算法视觉谜题(两类能力均不可或缺)为研究对象,证明感知能力而非推理能力是制约性能的关键瓶颈。用简单文本描述替代图像后,Claude模型平均性能提升超过20个点。随后我们评估了六种旨在引导模型在推理过程中建立视觉关联(无需思维链监督)的奖励设计方案。采用GRPO算法训练Qwen-2.5-VL-7B时,奖励设计可诱导包含自我反思与视觉参照的长链结构化推理,较基础模型获得5.56个点的性能提升。但这些增益分布不均:单一奖励无法改善所有类别,且具备可验证准确性信号的奖励会在域外泛化与域内准确率之间产生权衡。这些结论表明,感知感知奖励设计(将正确信号直接作用于感知源头,而非继承其错误的推理环节)是可行的前进方向。