Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: [email protected] on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: [email protected] rises to 82.0% (+2.4pp), [email protected] to 74.1% (+3.2pp), and [email protected] to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.
翻译:视觉语言模型(VLM)在单次空间定位中表现强劲,但缺乏观察和修正自身预测的机制。我们发现,简单引导VLM对其预测的渲染可视化结果进行迭代会导致灾难性失败:在指代表达理解任务中,[email protected]从79.6%骤降至48.7%(降幅达31个百分点),揭示了定位能力与自我修正能力之间的根本差距。本文提出迭代式视觉思维(IVT),该闭环框架让模型预测边界框、观察其在图像上的渲染结果,并通过视觉反馈实现迭代优化。两阶段训练方案弥合了自我修正差距:首先,利用基模型自身的预测作为真实误差,引导教师VLM生成修正推理轨迹,无需人工标注即可获得监督数据;其次,应用基于IoU奖励的组相对策略优化(GRPO)来稳定多步细化过程。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准测试(505个测试样本)中,基于IVT的SFT预热在所有指标上均超越单次基模型:[email protected]提升至82.0%(+2.4个百分点),[email protected]提升至74.1%(+3.2个百分点),[email protected]提升至48.3%(+2.8个百分点)。GRPO进一步将逐步IoU退化降低5倍,稳定了细化轨迹。所有训练仅使用单GPU上的2400个样本,证明空间自我修正是一种可学习的能力,能以适度规模进行培养。