Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.
翻译:大型视觉语言模型在视觉推理方面取得了显著进展,然而现有系统大多依赖单步或纯文本推理,限制了其在多个视觉语境中迭代优化理解的能力。为应对这一局限,我们引入了一个新的多轮视觉推理基准,其训练集和测试集涵盖检测与分割任务,支持在迭代推理场景下进行系统性评估。我们进一步提出了RegionReasoner——一个强化学习框架,该框架通过要求每个推理轨迹显式引用对应的参考边界框来强化基于实体的推理,同时通过全局-局部一致性奖励保持语义连贯性。该奖励机制从全局场景描述和区域级描述中提取关键物体与名词,并将其与推理轨迹对齐,以确保各推理步骤间的一致性。RegionReasoner通过结合实体关联忠实度与全局-局部语义对齐的结构化奖励进行优化。在检测与分割任务上的实验表明,RegionReasoner-7B与我们新引入的基准RegionDial-Bench共同显著提升了多轮推理准确率、空间实体定位精度以及全局-局部一致性,为这一新兴研究方向建立了坚实的基线。