Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

翻译：基于可验证奖励的强化学习（RLVR）训练的多模态推理模型（MRMs）在视觉推理基准测试中展示了更高的准确性。然而，我们观察到准确性提升往往以牺牲推理质量为代价：生成的思维链（CoT）轨迹常常与最终答案不一致，且缺乏对视觉证据的良好依据。我们系统性地研究了这一现象，涉及七个具有挑战性的现实世界空间推理基准测试，发现该问题影响了当代MRMs（如ViGoRL-Spatial、TreeVGR）以及我们使用标准群体相对策略优化（GRPO）训练的模型。我们沿两个互补维度刻画CoT推理质量：“逻辑一致性”（CoT是否蕴含最终答案？）和“视觉依据性”（每个推理步骤是否准确描述图像中的物体、属性和空间关系？）。为解决这一问题，我们提出忠实GRPO（FGRPO），一种通过拉格朗日对偶上升将一致性和依据性作为约束强化的GRPO变体。FGRPO在群体内的优势计算中融入批次级别的一致性和依据性约束，并在优化过程中自适应调整约束的相对重要性。我们在Qwen2.5-VL-7B和3B骨干网络上，跨越七个空间数据集评估FGRPO。结果表明，FGRPO显著提升了推理质量，将不一致率从24.5%降至1.7%，并将视觉依据性评分提高了13%。此外，相较于简单GRPO，它提升了最终答案的准确性，表明忠实推理能够带来更好的答案。