Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.
翻译:单目深度估计是一项基础的计算机视觉任务。从单张图像中恢复三维深度在几何上是不适定的,需要场景理解,因此深度学习的兴起带来突破并不令人意外。单目深度估计器的显著进展反映了模型容量的增长,从相对适度的CNN到大型Transformer架构。尽管如此,单目深度估计器在面对具有陌生内容和布局的图像时往往表现不佳,因为它们对视觉世界的知识受限于训练期间所见的数据,并且在零样本泛化到新领域时面临挑战。这促使我们探索最近的生成扩散模型中捕获的广泛先验知识是否能够实现更好、更可泛化的深度估计。我们提出了Marigold,一种基于Stable Diffusion且保留其丰富先验知识的仿射不变单目深度估计方法。该估计器仅使用合成训练数据,即可在单个GPU上在几天内进行微调。它在广泛的数据集上实现了最先进的性能,包括在特定情况下超过20%的性能提升。项目页面:https://marigoldmonodepth.github.io。