The recent development of foundation models for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at https://gitlab.ensta.fr/ssh/monocular-depth-rescaling.
翻译:近期基础模型(如Depth Anything)在单目深度估计领域的发展为零样本单目深度估计开辟了道路。由于该模型输出仿射不变视差图,恢复度量深度的主流技术通常需要对模型进行微调。然而,这一阶段并不直接,且可能因训练及数据集构建过程而成本高昂、耗时较长。数据集必须包含测试时将使用相机拍摄的图像及对应的真实深度值。此外,微调还可能降低原始模型的泛化能力。为此,本文提出一种新方法,利用传感器(如低分辨率激光雷达)或技术(如结合IMU提供位姿的运动恢复结构)提供的三维点云,对Depth Anything的预测结果进行重缩放。该方法避免了微调过程,在保持原始深度估计模型泛化能力的同时,对稀疏深度数据或深度模型的噪声具有鲁棒性。实验结果表明,相较于零样本单目度量深度估计方法,本方法性能显著提升;与微调方法相比结果具有竞争力;且相比深度补全方法展现出更优的鲁棒性。代码发布于 https://gitlab.ensta.fr/ssh/monocular-depth-rescaling。