The recent development of foundation models for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is costly to perform because of the training but also due to the creation of the dataset. It must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by low-cost sensors or techniques such as low-resolution LiDAR, stereo camera, structure-from-motion where poses are given by an IMU. Thus, this approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sensor or of the depth model. Our experiments highlight improvements relative to other metric depth estimation methods and competitive results compared to fine-tuned approaches. Code available at https://gitlab.ensta.fr/ssh/monocular-depth-rescaling.
翻译:近期如Depth Anything等单目深度估计基础模型的发展为零样本单目深度估计开辟了道路。由于该模型返回仿射不变视差图,恢复度量深度的主流技术通常需要对模型进行微调。然而,这一阶段不仅因训练过程成本高昂,还因数据集的构建而代价巨大——数据集必须包含测试时将使用的相机拍摄的图像及相应的真实深度值。此外,微调也可能削弱原始模型的泛化能力。为此,本文提出一种新方法,利用低成本传感器或技术(如低分辨率激光雷达、立体相机、结合惯性测量单元提供位姿的运动恢复结构技术)提供的三维点云,对Depth Anything的预测结果进行重标定。该方法避免了微调过程,在保持原始深度估计模型泛化能力的同时,对传感器噪声和深度模型噪声均具有鲁棒性。实验结果表明,相较于其他度量深度估计方法,本方法取得了显著改进;与微调方法相比,亦获得了具有竞争力的结果。代码发布于 https://gitlab.ensta.fr/ssh/monocular-depth-rescaling。