PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

翻译：摘要：在主流透视图像范式下，多模态大语言模型（MLLMs）仍难以处理空间理解问题，这种范式继承了类人感知的狭窄视野。对于导航、机器人搜索及三维场景理解而言，360度全景感知通过一次性捕获整个周围环境，提供了一种超感知形式。然而，现有MLLM流程通常将全景图分解为多张透视图，导致等距柱状投影（ERP）的球面结构基本被隐式处理。本文研究全景原生的理解，要求MLLM将ERP全景视为连续且以观察者为中心的空间进行推理。为此，我们首先定义了全景原生理解的关键能力，包括语义锚定、球面定位、参考坐标系变换及深度感知的三维空间推理。随后构建了一个大规模元数据生成流程，将多源ERP全景图转化为几何感知、语言基础与深度感知的监督信号，并将这些信号实例化为对齐能力的指令调优数据。在模型方面，我们提出了配备球面空间交叉注意力的PanoWorld，该机制将球面几何注入视觉流中。我们还构建了诊断基准测试集PanoSpace-Bench，用于评估ERP原生的空间推理能力。实验表明，PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于闭源和开源基线模型。这些结果证明，鲁棒的全景推理需要专门的全景原生监督与几何自适应模型优化。所有源代码与所提数据将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

多模态大型语言模型中的空间推理：任务、基准和方法综述

专知会员服务

23+阅读 · 2025年11月21日

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日

多模态幻觉的评估与检测综述

专知会员服务

18+阅读 · 2025年7月28日

《面向遥感的多模态小语言模型——引入思维链推理与GRPO技术》

专知会员服务

27+阅读 · 2025年5月16日