By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficient detail. Although recent diffusion-based MDE approaches exhibit a superior ability to extract details, they struggle in geometrically complex scenes that challenge their geometry prior, trained on less diverse 3D data. To leverage the complementary merits of both worlds, we propose BetterDepth to achieve geometrically correct affine-invariant MDE while capturing fine details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth layout is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure BetterDepth remains faithful to the depth conditioning while learning to add fine-grained scene details. With efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and on in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without further re-training.
翻译:通过在大规模数据集上训练,零样本单目深度估计(MDE)方法在自然场景中展现出鲁棒的性能,但往往存在细节不足的问题。尽管近期基于扩散的MDE方法在提取细节方面表现出卓越能力,但它们在几何结构复杂的场景中表现不佳,这些场景对其在多样性较低的3D数据上训练的几何先验提出了挑战。为了结合两种方法的互补优势,我们提出BetterDepth,旨在实现几何正确的仿射不变MDE,同时捕捉精细细节。具体而言,BetterDepth是一种基于条件的扩散优化器,它以预训练MDE模型的预测结果作为深度条件输入(其中全局深度布局已被良好捕捉),并基于输入图像迭代优化细节。针对此类优化器的训练,我们提出了全局预对齐和局部图像块掩码方法,以确保BetterDepth在保持对深度条件输入忠实性的同时,学习添加细粒度的场景细节。通过在小规模合成数据集上进行高效训练,BetterDepth在多样化的公共数据集及自然场景中实现了最先进的零样本MDE性能。此外,BetterDepth能够以即插即用的方式提升其他MDE模型的性能,无需重新训练。