This paper explores the potential of leveraging language priors learned by text-to-image diffusion models to address ambiguity and visual nuisance in monocular depth estimation. Particularly, traditional monocular depth estimation suffers from inherent ambiguity due to the absence of stereo or multi-view depth cues, and nuisance due to lack of robustness of vision. We argue that language prior in diffusion models can enhance monocular depth estimation by leveraging the geometric prior aligned with the language description, which is learned during text-to-image pre-training. To generate images that reflect the text properly, the model must comprehend the size and shape of specified objects, their spatial relationship, and the scale of the scene. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both image and text description that aligned with the scene to infer affine-invariant depth through a denoising process. We also show that language priors can guide the model's attention to specific regions and help it perceive the 3D scene in alignment with user intent. Simultaneously, it acts as a constraint to accelerate the convergence of the diffusion trajectory, since learning 3D properties from a condensed, low-dimensional language feature is more efficient compared with learning from a redundant, high-dimensional image feature. By training on HyperSim and Virtual KITTI, we achieve state-of-the-art zero-shot performance and a faster convergence speed, compared with other diffusion-based depth estimators, across NYUv2, KITTI, ETH3D, and ScanNet.
翻译:本文探讨了利用文本到图像扩散模型学习的语言先验来解决单目深度估计中的模糊性和视觉干扰的潜力。具体而言,传统的单目深度估计由于缺乏立体或多视角深度线索而存在固有的模糊性,并因视觉鲁棒性不足而受到干扰。我们认为,扩散模型中的语言先验可以通过利用与文本描述对齐的几何先验来增强单目深度估计,这种几何先验是在文本到图像预训练过程中学习到的。为了生成正确反映文本的图像,模型必须理解指定物体的大小和形状、它们的空间关系以及场景的尺度。因此,我们提出了PriorDiffusion,它使用一个预训练的文本到图像扩散模型,该模型同时接收图像和与场景对齐的文本描述,通过去噪过程来推断仿射不变深度。我们还证明了语言先验可以引导模型的注意力到特定区域,并帮助其按照用户意图感知3D场景。同时,它作为一种约束加速了扩散轨迹的收敛,因为从紧凑的低维语言特征学习3D属性,相比于从冗余的高维图像特征学习更为高效。通过在HyperSim和Virtual KITTI数据集上进行训练,与其它基于扩散的深度估计器相比,我们在NYUv2、KITTI、ETH3D和ScanNet数据集上实现了最先进的零样本性能和更快的收敛速度。