Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
翻译:多模态大语言模型展现出强大的视觉感知能力,但在变换视角下的空间推理方面仍存在局限。我们将此挑战定义为360度全方位图像中的视角条件空间推理,其中广阔的场景覆盖减少了因局部观测导致的歧义,但并未消除对视角依赖推理的需求。为评估该能力,我们引入PCSR-Bench——一个诊断性基准,包含来自26个室内环境、2600张全方位图像的84373个问答对。PCSR-Bench涵盖八项任务,横跨基础感知(如物体计数、相对距离与相对方向)和高级PCSR,包括组合链、自我中心旋转、视角重锚定、自我中心畸变及有限视场可见性。我们评估了14个代表性多模态大语言模型,并观察到显著的感知-推理差距:基础相对方向任务准确率达57.59%,但在自我中心旋转任务中降至13.49%,自我中心畸变任务为7.13%,开放式组合推理任务仅为0.64%。为探测该差距的可塑性,我们基于7B规模模型开展强化学习诊断研究。在受控设置下,奖励塑形将匹配的7B基线从31.10%提升至60.06%,这表明PCSR具有部分可塑性而非完全不可改变。然而,性能提升具有任务选择性,对奖励设计(包括权重分配与奖励公式)敏感,且部分依赖于评估协议。这些结果将PCSR定位为当前多模态大语言模型的关键瓶颈,并揭示了在针对性优化下存在有限但有意义的改进空间。