Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

翻译：多模态大语言模型展现出强大的视觉感知能力，但在变换视角下的空间推理方面仍存在局限。我们将此挑战定义为360度全方位图像中的视角条件空间推理，其中广阔的场景覆盖减少了因局部观测导致的歧义，但并未消除对视角依赖推理的需求。为评估该能力，我们引入PCSR-Bench——一个诊断性基准，包含来自26个室内环境、2600张全方位图像的84373个问答对。PCSR-Bench涵盖八项任务，横跨基础感知（如物体计数、相对距离与相对方向）和高级PCSR，包括组合链、自我中心旋转、视角重锚定、自我中心畸变及有限视场可见性。我们评估了14个代表性多模态大语言模型，并观察到显著的感知-推理差距：基础相对方向任务准确率达57.59%，但在自我中心旋转任务中降至13.49%，自我中心畸变任务为7.13%，开放式组合推理任务仅为0.64%。为探测该差距的可塑性，我们基于7B规模模型开展强化学习诊断研究。在受控设置下，奖励塑形将匹配的7B基线从31.10%提升至60.06%，这表明PCSR具有部分可塑性而非完全不可改变。然而，性能提升具有任务选择性，对奖励设计（包括权重分配与奖励公式）敏感，且部分依赖于评估协议。这些结果将PCSR定位为当前多模态大语言模型的关键瓶颈，并揭示了在针对性优化下存在有限但有意义的改进空间。