While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd
翻译:虽然单目深度估计方法在标准基准测试中取得了显著进展,但零样本度量深度估计仍未解决。挑战包括室内外场景的联合建模(其通常表现出显著不同的RGB和深度分布),以及由于未知相机内参导致的深度尺度模糊性。近期工作提出了专门的多头架构用于联合建模室内外场景。相比之下,我们倡导一种通用的、任务无关的扩散模型,并采用多项改进,例如对数尺度深度参数化以支持室内外场景联合建模、基于视场(FOV)的条件化处理尺度模糊性、以及在训练过程中通过合成增强视场来泛化训练数据集中有限的相机内参。此外,通过采用比常用方法更多样化的训练混合数据以及高效的扩散参数化,我们的方法DMD(度量深度扩散模型)在仅使用少量去噪步数的情况下,在零样本室内和室外数据集上分别实现了比当前最先进方法相对误差(REL)降低25%和33%。更多概述请见https://diffusion-vision.github.io/dmd。