基于透视描述预测相机位姿的空间推理 (Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning)

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

翻译：多图像空间推理对当前的多模态大语言模型（MLLMs）仍具挑战性。单视角感知本质上是二维的，而多视角推理则需建立跨视角的连贯场景理解。本研究特别关注视角转换任务，即模型必须从多视角观测中构建一致的3D理解，并利用该理解从语言指定的新视角进行推理。我们提出了CAMCUE——一种位姿感知的多图像框架，该框架将相机位姿作为跨视图融合和新视角推理的显式几何锚点。CAMCUE将各视角位姿信息注入视觉标记，将自然语言视角描述映射至目标相机位姿，并合成位姿条件化的想象目标视图以支持问题回答。为支撑该研究设定，我们构建了包含27,668个训练样本和508个测试样本的CAMCUE-DATA数据集，其中每个样本均包含多视角图像与位姿数据，以及多样化的目标视角描述和视角转换问题。测试集还包含人工标注的视角描述，用于评估模型对人类语言的泛化能力。CAMCUE将整体准确率提升9.06%，在自然语言视角描述预测目标位姿的任务中，旋转预测在20°误差范围内的准确率超过90%，平移预测在0.5误差阈值内达到高精度。这种直接映射机制避免了耗时的测试阶段搜索匹配，将每个样本的推理时间从256.6秒缩减至1.45秒，为现实场景中的快速交互应用提供了可能。