Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.
翻译:目前,将扩散模型直接应用于高分辨率图像的像素空间存在困难。现有方法转而采用低维空间的扩散(潜扩散)或生成多级超分辨率级联结构。但这些方法增加了扩散框架的复杂性。本文旨在保持模型尽可能简洁的前提下,改进高分辨率图像的去噪扩散。研究核心问题是:如何在高分辨率图像上训练标准的去噪扩散模型,同时获得与这些替代方法相当的性能?四项主要发现如下:1)噪声调度需针对高分辨率图像进行调整;2)仅需缩放架构特定部分即可;3)应在架构的特定位置添加丢弃机制;4)下采样是避免高分辨率特征图的有效策略。结合这些简单高效的技术,我们在无需采样修正器的ImageNet图像生成任务中,达到了扩散模型领域的领先水平。