PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

This paper explores the potential of leveraging language priors learned by text-to-image diffusion models to address ambiguity and visual nuisance in monocular depth estimation. Particularly, traditional monocular depth estimation suffers from inherent ambiguity due to the absence of stereo or multi-view depth cues, and nuisance due to lack of robustness of vision. We argue that language prior in diffusion models can enhance monocular depth estimation by leveraging the geometric prior aligned with the language description, which is learned during text-to-image pre-training. To generate images that reflect the text properly, the model must comprehend the size and shape of specified objects, their spatial relationship, and the scale of the scene. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both image and text description that aligned with the scene to infer affine-invariant depth through a denoising process. We also show that language priors can guide the model's attention to specific regions and help it perceive the 3D scene in alignment with user intent. Simultaneously, it acts as a constraint to accelerate the convergence of the diffusion trajectory, since learning 3D properties from a condensed, low-dimensional language feature is more efficient compared with learning from a redundant, high-dimensional image feature. By training on HyperSim and Virtual KITTI, we achieve state-of-the-art zero-shot performance and a faster convergence speed, compared with other diffusion-based depth estimators, across NYUv2, KITTI, ETH3D, and ScanNet.

翻译：本文探讨了利用文本到图像扩散模型学习的语言先验来解决单目深度估计中的模糊性和视觉干扰的潜力。具体而言，传统的单目深度估计由于缺乏立体或多视角深度线索而存在固有的模糊性，并因视觉鲁棒性不足而受到干扰。我们认为，扩散模型中的语言先验可以通过利用与文本描述对齐的几何先验来增强单目深度估计，这种几何先验是在文本到图像预训练过程中学习到的。为了生成正确反映文本的图像，模型必须理解指定物体的大小和形状、它们的空间关系以及场景的尺度。因此，我们提出了PriorDiffusion，它使用一个预训练的文本到图像扩散模型，该模型同时接收图像和与场景对齐的文本描述，通过去噪过程来推断仿射不变深度。我们还证明了语言先验可以引导模型的注意力到特定区域，并帮助其按照用户意图感知3D场景。同时，它作为一种约束加速了扩散轨迹的收敛，因为从紧凑的低维语言特征学习3D属性，相比于从冗余的高维图像特征学习更为高效。通过在HyperSim和Virtual KITTI数据集上进行训练，与其它基于扩散的深度估计器相比，我们在NYUv2、KITTI、ETH3D和ScanNet数据集上实现了最先进的零样本性能和更快的收敛速度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日