We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io
翻译:我们借鉴扩散模型在高保真图像生成领域的最新成功,将去噪扩散模型应用于单目深度估计问题。为此,我们引入了多项创新来解决训练数据中因含噪、不完整深度图所导致的问题,包括步长展开去噪扩散、$L_1$损失函数以及训练过程中的深度填充。针对监督训练数据有限的问题,我们利用了自监督图像到图像翻译任务的预训练。尽管该方法采用了通用损失函数和架构,但我们的DepthGen模型在室内NYU数据集上达到了最先进的性能,并在室外KITTI数据集上取得了接近最优的结果。此外,利用多模态后验分布,DepthGen自然地表征了深度歧义性(例如透明表面产生的情况),其零样本性能与深度补全功能相结合,构建了一个简单而有效的文本到3D生成管线。项目页面:https://depth-gen.github.io