While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.
翻译:尽管多模态大语言模型(MLLMs)在单图像空间推理方面取得了显著进展,但多图像空间推理仍需整合来自多个视角的信息,依然具有挑战性。认知研究表明,人类通过两种机制处理此类任务:跨视角对应——识别不同视角中对应相同物理位置的区域;以及逐步视角变换——顺序组合相对视角变化。然而,现有研究仅部分且常隐含地纳入这些机制,未对两者进行显式监督。我们提出HATCH(人类感知式跨视角对应与视角变换训练),一个包含两个互补目标的训练框架:(1) 补丁级空间对齐,促使补丁表示在视角间对齐空间对应区域;(2) 先行动后回答推理,要求模型在预测最终答案前生成显式的视角转换动作。在三个基准上的实验表明,HATCH以明显优势持续优于同等规模的基线模型,且与更大规模模型相比取得了竞争力的结果,同时保持单图像推理能力。