This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in https://heiheishuang.xyz/M2Depth .
翻译:本文提出了一种新颖的自监督双帧多相机度量深度估计网络,称为M${^2}$Depth,旨在为自动驾驶场景预测可靠且具备尺度感知能力的环绕深度。与现有工作采用单时间步多视角图像或单相机多时间步图像不同,M${^2}$Depth以多相机的时间相邻双帧图像作为输入,生成高质量的环绕深度。我们首先分别构建空间域和时间域代价体,并提出一种时空融合模块,用于整合时空信息以形成强体积表征。此外,我们将SAM特征的神经先验与内部特征相结合,以减少前景与背景间的歧义性并强化深度边缘。在nuScenes和DDAD基准上的大量实验结果表明,M${^2}$Depth取得了最先进的性能。更多结果请见 https://heiheishuang.xyz/M2Depth 。