In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see https://scaling-diffusion-perception.github.io .
翻译:本文提出,基于扩散模型的迭代计算不仅为生成任务,也为视觉感知任务提供了一个强大的范式。我们将深度估计、光流和分割等任务统一到图像到图像的转换框架下,并展示了扩散模型如何通过扩展训练和测试时的计算资源来提升这些感知任务的性能。通过对这些扩展行为的细致分析,我们提出了多种高效训练扩散模型以应对视觉感知任务的技术。我们的模型在显著减少数据和计算资源的情况下,实现了与现有最优方法相当或更优的性能。代码和模型获取请访问 https://scaling-diffusion-perception.github.io。