Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results on KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion approach could reach state-of-the-art performance in both indoor and outdoor scenarios with acceptable inference time.
翻译:单目深度估计是一项具有挑战性的任务,旨在从单张二维图像中预测像素级深度。现有方法通常将这一问题建模为回归或分类任务。我们提出了一种新方法DiffusionDepth,它将单目深度估计重新表述为去噪扩散过程。该方法学习一个迭代去噪过程,在单目视觉条件的引导下,将随机深度分布“去噪”为深度图。该过程在由专用深度编码器和解码器编码的潜在空间中执行。模型并非扩散真实深度(GT),而是学习逆转自身细化后的深度向随机深度分布扩散的过程。这种自扩散公式克服了生成模型应用于稀疏GT深度场景的困难。所提出的方法通过逐步细化深度估计为该任务带来优势,尤其适用于生成精确且高度细节化的深度图。在KITTI和NYU-Depth-V2数据集上的实验结果表明,一种简单而高效的扩散方法能够在室内和室外场景中以可接受的推理时间达到最先进的性能。