Depth estimation has been widely studied and serves as the fundamental step of 3D perception for autonomous driving. Though significant progress has been made for monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation, to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations. Further, the Adjacent-view Attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surround-view feature maps. Extensive experiments show that our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
翻译:深度估计已被广泛研究,并作为自动驾驶中三维感知的基础步骤。尽管过去几十年中单目深度估计取得了显著进展,但这些尝试主要基于仅提供前向摄像头的KITTI基准数据集,忽略了环视摄像头之间的相关性。本文提出S3Depth——一种监督式环视深度估计的简单基线方法,旨在联合预测多个环视摄像头的深度图。具体而言,我们采用全局到局部的特征提取模块,结合卷积神经网络与Transformer层以增强表征能力。此外,我们提出邻域视角注意力机制,实现视角内与视角间的特征传播:前者通过各视角内的自注意力模块完成,后者则通过邻域注意力模块实现,该模块在多摄像头间计算注意力,从而在环视特征图间交换多尺度表征。大量实验表明,我们的方法在DDAD和nuScenes数据集上均取得了优于现有最优方法的性能。