Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.
翻译:摘要:在主流透视图像范式下,多模态大语言模型(MLLMs)仍难以处理空间理解问题,这种范式继承了类人感知的狭窄视野。对于导航、机器人搜索及三维场景理解而言,360度全景感知通过一次性捕获整个周围环境,提供了一种超感知形式。然而,现有MLLM流程通常将全景图分解为多张透视图,导致等距柱状投影(ERP)的球面结构基本被隐式处理。本文研究全景原生的理解,要求MLLM将ERP全景视为连续且以观察者为中心的空间进行推理。为此,我们首先定义了全景原生理解的关键能力,包括语义锚定、球面定位、参考坐标系变换及深度感知的三维空间推理。随后构建了一个大规模元数据生成流程,将多源ERP全景图转化为几何感知、语言基础与深度感知的监督信号,并将这些信号实例化为对齐能力的指令调优数据。在模型方面,我们提出了配备球面空间交叉注意力的PanoWorld,该机制将球面几何注入视觉流中。我们还构建了诊断基准测试集PanoSpace-Bench,用于评估ERP原生的空间推理能力。实验表明,PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于闭源和开源基线模型。这些结果证明,鲁棒的全景推理需要专门的全景原生监督与几何自适应模型优化。所有源代码与所提数据将公开发布。