We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research
翻译:我们提出一个基于条件扩散流水线的简单、高效且强大的密集视觉预测框架。该方法遵循“噪声到图”的生成范式,通过图像引导逐步去除随机高斯分布中的噪声来实现预测。该算法名为DDP,将去噪扩散过程高效扩展至现代感知流水线中。无需任务特定设计与架构定制,DDP可轻松泛化至多数密集预测任务(如语义分割和深度估计)。此外,与以往单步判别式方法相比,DDP展现出动态推理和不确定性感知等引人特性。我们在三个代表性任务的六个不同基准上展示了顶尖结果——无需技巧,DDP在每个任务上均达到或超越专家级模型性能。例如语义分割(Cityscapes上83.9 mIoU)、BEV地图分割(nuScenes上70.6 mIoU)和深度估计(KITTI上0.05 REL)。我们期望该方法能作为坚实基线,推动未来研究。