By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.
翻译:通过在大规模数据集上进行训练,零样本单目深度估计方法在野外场景中展现出鲁棒性能,但往往存在细节不够精确的问题。尽管近期基于扩散的单目深度估计方法显示出吸引人的细节提取能力,但由于难以从多样数据集中获取鲁棒的几何先验,它们在几何结构复杂的场景中仍然面临挑战。为了结合两类方法的互补优势,我们提出BetterDepth,以高效实现几何正确的仿射不变单目深度估计性能,同时捕捉细粒度细节。具体而言,BetterDepth是一种基于条件的扩散优化器,它以预训练单目深度估计模型的预测结果作为深度条件输入——其中全局深度上下文已被充分捕捉——并基于输入图像迭代优化细节。针对此类优化器的训练,我们提出了全局预对齐和局部块掩码方法,以确保BetterDepth在捕捉细粒度场景细节的同时,忠实于深度条件输入。通过在小型合成数据集上进行高效训练,BetterDepth在多样化的公共数据集和野外场景中实现了最先进的零样本单目深度估计性能。此外,BetterDepth能够以即插即用的方式提升其他单目深度估计模型的性能,无需额外重新训练。