In panorama understanding, the widely used equirectangular projection (ERP) entails boundary discontinuity and spatial distortion. It severely deteriorates the conventional CNNs and vision Transformers on panoramas. In this paper, we propose a simple yet effective architecture named PanoSwin to learn panorama representations with ERP. To deal with the challenges brought by equirectangular projection, we explore a pano-style shift windowing scheme and novel pitch attention to address the boundary discontinuity and the spatial distortion, respectively. Besides, based on spherical distance and Cartesian coordinates, we adapt absolute positional embeddings and relative positional biases for panoramas to enhance panoramic geometry information. Realizing that planar image understanding might share some common knowledge with panorama understanding, we devise a novel two-stage learning framework to facilitate knowledge transfer from the planar images to panoramas. We conduct experiments against the state-of-the-art on various panoramic tasks, i.e., panoramic object detection, panoramic classification, and panoramic layout estimation. The experimental results demonstrate the effectiveness of PanoSwin in panorama understanding.
翻译:在全景理解中,广泛使用的等距柱状投影(ERP)会导致边界不连续性和空间畸变,严重损害传统CNN和视觉Transformer在全景图像上的性能。本文提出一种简单而有效的架构PanoSwin,用于学习基于ERP的全景表示。为应对等距柱状投影带来的挑战,我们分别探索了pano风格的移位窗口方案与新颖的俯仰注意力机制,以解决边界不连续性和空间畸变问题。此外,基于球面距离和笛卡尔坐标,我们为全景图像适配了绝对位置嵌入和相对位置偏置,以增强全景几何信息。基于平面图像理解与全景理解可能共享部分通用知识的认识,我们设计了一种新颖的两阶段学习框架,促进从平面图像到全景图像的知识迁移。我们在多种全景任务(即全景目标检测、全景分类和全景布局估计)上与现有最优方法进行了对比实验。实验结果证明了PanoSwin在全景理解中的有效性。