Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.
翻译:当前视觉语言模型在细粒度空间推理方面存在不足,尤其是在需要多步逻辑推理和精确空间对齐的场景中。本研究提出SpatialReasoner-R1——一种专门针对上述局限设计的视觉语言推理模型。为构建高质量的空间推理监督信号,我们设计了多模型蒙特卡洛树搜索方法,该方法能生成多样化且逻辑一致的长链思维推理轨迹。此外,我们提出细粒度直接偏好优化方法,通过引入针对描述性定位和逻辑推理的片段级偏好粒度,在空间奖励机制的引导下,依据视觉一致性、空间定位准确性和逻辑连贯性对候选响应进行评估。实验结果表明,在空间定性与定量任务中,细粒度直接偏好优化相比标准直接偏好优化分别实现了4.1%和9.0%的相对性能提升。采用细粒度直接偏好优化训练的SpatialReasoner-R1在SpatialRGPT-Bench上创造了新的性能纪录,平均准确率超越最强基线9.4%,同时在通用视觉语言任务上保持竞争优势。