Estimating the depths of equirectangular (i.e., 360) images (EIs) is challenging given the distorted 180 x 360 field-of-view, which is hard to be addressed via convolutional neural network (CNN). Although a transformer with global attention achieves significant improvements over CNN for EI depth estimation task, it is computationally inefficient, which raises the need for transformer with local attention. However, to apply local attention successfully for EIs, a specific strategy, which addresses distorted equirectangular geometry and limited receptive field simultaneously, is required. Prior works have only cared either of them, resulting in unsatisfactory depths occasionally. In this paper, we propose an equirectangular geometry-biased transformer termed EGformer. While limiting the computational cost and the number of network parameters, EGformer enables the extraction of the equirectangular geometry-aware local attention with a large receptive field. To achieve this, we actively utilize the equirectangular geometry as the bias for the local attention instead of struggling to reduce the distortion of EIs. As compared to the most recent EI depth estimation studies, the proposed approach yields the best depth outcomes overall with the lowest computational cost and the fewest parameters, demonstrating the effectiveness of the proposed methods.
翻译:等距柱面(即360度)图像的深度估计具有挑战性,因其180×360度视场存在畸变,卷积神经网络(CNN)难以解决。尽管具有全局注意力机制的Transformer在EI深度估计任务中相比CNN取得了显著改进,但其计算效率低下,催生了对局部注意力机制Transformer的需求。然而,要将局部注意力成功应用于EI,需要同时处理畸变的等距柱面几何和有限的感受野。以往工作仅关注其中之一,导致深度估计结果偶尔不尽如人意。本文提出一种名为EGformer的等距柱面几何偏置Transformer。在限制计算成本和网络参数数量的同时,EGformer能够提取具有大感受野的等距柱面几何感知局部注意力。为此,我们主动利用等距柱面几何作为局部注意力的偏置,而非致力于减少EI的畸变。与最新EI深度估计研究相比,本方法以最低计算成本和最少参数取得了整体最佳深度结果,验证了所提方法的有效性。