While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.
翻译:尽管多模态大语言模型(MLLMs)在单图像空间推理方面取得了显著进展,但需要整合多视角信息的**多图像空间推理**仍然具有挑战性。认知研究表明,人类通过两种机制处理此类任务:**跨视角对应**(识别不同视角中对应于相同物理位置的区域)和**逐步视角变换**(按顺序组合相对视角变化)。然而,现有研究仅部分且通常隐含地融入了这些机制,缺乏对两者的显式监督。我们提出了面向跨视角对应与视角变化的类人训练框架(HATCH),该训练框架包含两个互补目标:(1)**块级空间对齐**,促使跨视角的块表征在空间对应区域上对齐;(2)**先行动后推理**,要求模型在预测最终答案前生成显式的视角转换动作。在三个基准测试上的实验表明,HATCH在保持单图像推理能力的同时,始终以明显优势超越同等规模的基线模型,并与规模大得多的模型取得了具有竞争力的结果。