Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

翻译：多模态大语言模型（MLLMs）展现出强大的视觉感知能力，但在视角变化条件下的空间推理方面仍存在局限。我们以360度全方位图像中的视角条件空间推理（PCSR）为切入点研究该挑战，此类图像凭借广阔的场景覆盖减少了局部观测的歧义性，但并未消除对视角依赖型推理的需求。为评估该能力，我们构建了PCSR-Bench诊断基准，包含来自26个室内环境中2600张全方位图像的84373个问答对。PCSR-Bench涵盖八个任务，横跨基础感知（如物体计数、相对距离、相对方向）与高级PCSR任务，包括组合链推理、自我中心旋转、视角重定向、自我中心畸变及有限视野可见性判断。我们评估了14个代表性MLLMs，观察到显著的感知-推理鸿沟：基础相对方向任务准确率达57.59%，但在自我中心旋转任务中降至13.49%，自我中心畸变任务中为7.13%，而开放式组合推理任务仅达0.64%。为探究该鸿沟的可塑性，我们基于7B规模模型开展了强化学习诊断研究。在受控条件下，奖励塑形使匹配的7B基线模型从31.10%提升至60.06%，表明PCSR具有部分可塑性而非完全固定不变。然而，性能提升呈现任务选择性，对包含权重分配和奖励公式化的奖励设计敏感，且部分依赖评估协议。这些结果将PCSR定位为当前MLLMs的关键瓶颈，并揭示了在针对性优化下存在有限但重要的改善空间。