Depth estimation has been widely studied and serves as the fundamental step of 3D perception for intelligent vehicles. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose S3Depth, a Simple Baseline for Supervised Surround-view Depth Estimation, to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module which combines CNN with transformer layers for enriched representations. Further, the Adjacent-view Attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surround-view feature maps. Extensive experiments show that our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets.
翻译:深度估计已被广泛研究,并且是智能车辆三维感知的基础步骤。尽管过去几十年在单目深度估计方面取得了显著进展,但这些尝试主要是在仅使用前视摄像头的KITTI基准上进行的,忽略了环视摄像头之间的关联。在本文中,我们提出S3Depth,一种简单的监督式环视深度估计基线,以联合预测多个环绕摄像头的深度图。具体而言,我们采用全局到局部的特征提取模块,该模块结合了CNN与Transformer层以增强表示。此外,提出了邻接视图注意力机制,以实现视图内和视图间的特征传播。前者通过每个视图内的自注意力模块实现,而后者则由邻接注意力模块实现,该模块计算多摄像头之间的注意力以在环视特征图之间交换多尺度表示。大量实验表明,我们的方法在DDAD和nuScenes数据集上均优于现有最先进方法。