We present a novel approach for metric dense depth estimation based on the fusion of a single-view image and a sparse, noisy Radar point cloud. The direct fusion of heterogeneous Radar and image data, or their encodings, tends to yield dense depth maps with significant artifacts, blurred boundaries, and suboptimal accuracy. To circumvent this issue, we learn to augment versatile and robust monocular depth prediction with the dense metric scale induced from sparse and noisy Radar data. We propose a Radar-Camera framework for highly accurate and fine-detailed dense depth estimation with four stages, including monocular depth prediction, global scale alignment of monocular depth with sparse Radar points, quasi-dense scale estimation through learning the association between Radar points and image patches, and local scale refinement of dense depth using a scale map learner. Our proposed method significantly outperforms the state-of-the-art Radar-Camera depth estimation methods by reducing the mean absolute error (MAE) of depth estimation by 25.6% and 40.2% on the challenging nuScenes dataset and our self-collected ZJU-4DRadarCam dataset, respectively.
翻译:我们提出了一种新颖的度量稠密深度估计方法,该方法基于单视图图像与稀疏、含噪雷达点云的融合。直接将异质的雷达与图像数据(或其编码)进行融合,往往会导致稠密深度图出现显著伪影、边界模糊以及精度欠佳的问题。为解决这一难题,我们通过学习利用稀疏且含噪的雷达数据所蕴含的稠密度量尺度,来增强通用且鲁棒的单目深度预测。我们提出了一种包含四个阶段的雷达-相机框架,以实现高精度、细节丰富的稠密深度估计,具体包括:单目深度预测、单目深度与稀疏雷达点的全局尺度对齐、通过学习雷达点与图像块关联性进行的准稠密尺度估计,以及利用尺度图学习器对稠密深度进行局部尺度精调。所提出的方法在具有挑战性的nuScenes数据集和我们自采的ZJU-4DRadarCam数据集上,分别将深度估计的平均绝对误差(MAE)降低了25.6%和40.2%,显著优于当前最先进的雷达-相机深度估计方法。