The generalization of monocular metric depth estimation (MMDE) has been a longstanding challenge. Recent methods made progress by combining relative and metric depth or aligning input image focal length. However, they are still beset by challenges in camera, scene, and data levels: (1) Sensitivity to different cameras; (2) Inconsistent accuracy across scenes; (3) Reliance on massive training data. This paper proposes SM4Depth, a seamless MMDE method, to address all the issues above within a single network. First, we reveal that a consistent field of view (FOV) is the key to resolve ``metric ambiguity'' across cameras, which guides us to propose a more straightforward preprocessing unit. Second, to achieve consistently high accuracy across scenes, we explicitly model the metric scale determination as discretizing the depth interval into bins and propose variation-based unnormalized depth bins. This method bridges the depth gap of diverse scenes by reducing the ambiguity of the conventional metric bin. Third, to reduce the reliance on massive training data, we propose a ``divide and conquer" solution. Instead of estimating directly from the vast solution space, the correct metric bins are estimated from multiple solution sub-spaces for complexity reduction. Finally, with just 150K RGB-D pairs and a consumer-grade GPU for training, SM4Depth achieves state-of-the-art performance on most previously unseen datasets, especially surpassing ZoeDepth and Metric3D on mRI$_\theta$. The code can be found at https://github.com/1hao-Liu/SM4Depth.
翻译:单目度量深度估计(MMDE)的泛化能力一直是长期挑战。近期方法通过结合相对深度与度量深度或对齐输入图像焦距取得进展,但仍受困于相机、场景与数据层面的三重难题:(1)对不同相机的敏感性;(2)跨场景精度不一致;(3)对海量训练数据的依赖。本文提出SM4Depth,一种无缝的单目度量深度估计方法,在单一网络内解决上述所有问题。首先,我们揭示一致视场角(FOV)是破解跨相机"度量模糊性"的关键,由此提出更简洁的预处理单元。其次,为跨场景实现稳定高精度,我们将度量尺度确定显式建模为将深度区间离散化为箱位,并提出基于方差的非归一化深度箱位。该方法通过降低传统度量箱位的模糊性弥合不同场景的深度差异。第三,为减少对海量训练数据的依赖,我们提出"分而治之"策略:不是直接从广阔解空间进行估计,而是从多个解子空间中估计正确度量箱位以实现复杂度降低。最终,仅需15万组RGB-D图像对及消费级GPU进行训练,SM4Depth在大多数未见数据集上达到最优性能,尤其在mRI$_\theta$指标上超越ZoeDepth与Metric3D。代码见https://github.com/1hao-Liu/SM4Depth。