Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons, enabling balanced credit assignment during training. Experimental results on general LVLM, high-resolution, and visual grounding benchmarks validate the effectiveness of Ground-R1 and show that SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.
翻译:大型视觉语言模型(LVLMs)已成为强大的通用助手,但其预测常因缺乏充分的视觉证据基础而可靠性不足且难以解释。新兴的“基于图像的思考”范式试图通过将推理过程显式锚定到图像区域来解决此问题。然而,我们通过实证研究发现,现有方法大多在优化过程中存在系统性的尺度驱动偏差:训练奖励主要由大面积视觉区域主导,抑制了模型从小型但语义关键的证据中学习,导致推理时产生虚假的视觉基础。为克服这一局限,我们提出了Ground-R1,这是一个基于新型尺度相对策略优化(SRPO)目标训练的、去偏差的“基于图像的思考”框架,该目标替代了标准的GRPO。具体而言,我们的SRPO通过尺度感知分箱及箱内/箱间比较,重新校准了针对不同尺寸证据区域的奖励学习,从而在训练过程中实现平衡的信用分配。在通用LVLM、高分辨率及视觉基础基准测试上的实验结果验证了Ground-R1的有效性,并表明SRPO在回答准确性和证据基础性方面均较标准GRPO取得了一致的提升。