Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.
翻译:单目深度估计存在尺度模糊性,因此需要尺度监督来生成公制预测。即便如此,所得模型将具有几何特异性,其学习到的尺度无法直接跨领域迁移。为此,近期研究转而关注相对深度,舍弃尺度以提升归一化尺度下的零样本迁移性能。本文提出ZeroDepth——一种新型单目深度估计框架,能够为来自不同领域和相机参数的任意测试图像预测公制尺度。这一目标通过以下两点实现:(i) 使用输入级几何嵌入,使网络能够学习关于物体的尺度先验; (ii) 通过基于单帧信息条件化的变分潜表示,将编码器与解码器阶段解耦。我们在室外(KITTI、DDAD、nuScenes)和室内(NYUv2)基准上评估了ZeroDepth,使用同一预训练模型在这两种场景中均取得了最新最佳性能,超越了那些使用领域内数据训练且需测试时尺度缩放来生成公制估计的方法。