In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.
翻译:在视觉-语言模型(VLM)时代,增强多模态推理能力仍是一项关键挑战,特别是在处理模糊或复杂视觉输入时,初始推理常导致幻觉或逻辑错误。现有VLM往往生成看似合理但缺乏依据的回答,即便在提示其进行“反思”时,其修正仍可能脱离图像证据。为解决此问题,我们提出MIRROR框架,即通过视觉区域反思实现多模态迭代推理。通过将视觉反思嵌入为核心机制,MIRROR被构建为一个闭环过程,包含草稿、评判、基于区域的验证和修正,这些步骤重复迭代直至输出与视觉证据对齐。为促进该模型的训练,我们构建了**ReflectV**数据集,这是一个用于多轮监督的视觉反思数据集,明确包含反思触发信号、基于区域的验证行为以及基于视觉证据的答案修正。在通用视觉-语言基准测试和代表性视觉-语言推理基准测试上的实验表明,MIRROR提升了正确性并减少了视觉幻觉,证明了将反思训练为一种寻求证据、感知区域的验证过程(而非纯文本修正步骤)的价值。