Panoramic distortion poses a significant challenge in 360 depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, which can result in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar re-projection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions.
翻译:全景畸变给360度深度估计带来了显著挑战,尤其在南北极区域表现突出。现有方法要么采用双投影融合策略来消除畸变,要么通过建模长程依赖关系捕捉全局结构,但这可能导致结构模糊或局部感知不足。本文提出名为SGFormer的球面几何Transformer,通过创新性地将球面几何先验融入视觉Transformer来解决上述问题。为此,我们将Transformer解码器重构为球面先验解码器(简称SPDecoder),致力于在解码过程中保持球面结构的完整性。具体而言,我们利用双极重投影、环形旋转和曲线局部嵌入,分别保持等畸变、连续性和表面距离等球面特性。此外,我们提出基于查询的全局条件位置嵌入,以补偿不同分辨率下的空间结构。该方法不仅能增强空间位置的全局感知能力,还能锐化不同区块间的深度结构。最终,我们在主流基准上开展大量实验,证明了本方法相较于现有最优方案的优越性。