Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: [email protected] on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: [email protected] rises to 82.0% (+2.4pp), [email protected] to 74.1% (+3.2pp), and [email protected] to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

翻译：视觉语言模型（VLM）在单次空间定位中表现强劲，但缺乏观察和修正自身预测的机制。我们发现，简单引导VLM对其预测的渲染可视化结果进行迭代会导致灾难性失败：在指代表达理解任务中，[email protected]从79.6%骤降至48.7%（降幅达31个百分点），揭示了定位能力与自我修正能力之间的根本差距。本文提出迭代式视觉思维（IVT），该闭环框架让模型预测边界框、观察其在图像上的渲染结果，并通过视觉反馈实现迭代优化。两阶段训练方案弥合了自我修正差距：首先，利用基模型自身的预测作为真实误差，引导教师VLM生成修正推理轨迹，无需人工标注即可获得监督数据；其次，应用基于IoU奖励的组相对策略优化（GRPO）来稳定多步细化过程。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准测试（505个测试样本）中，基于IVT的SFT预热在所有指标上均超越单次基模型：[email protected]提升至82.0%（+2.4个百分点），[email protected]提升至74.1%（+3.2个百分点），[email protected]提升至48.3%（+2.8个百分点）。GRPO进一步将逐步IoU退化降低5倍，稳定了细化轨迹。所有训练仅使用单GPU上的2400个样本，证明空间自我修正是一种可学习的能力，能以适度规模进行培养。