In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
翻译:本文研究了利用预训练扩散模型生成远超训练图像尺寸的高分辨率图像的能力,并确保生成图像具有任意宽高比。当直接使用训练分辨率为512×512的稳定扩散模型生成1024×1024的高分辨率图像时,我们观察到物体重复和结构不合理等持续性问题。现有高分辨率生成方法(如基于注意力和联合扩散方法)无法有效解决这些问题。我们从新视角分析了扩散模型中U-Net的结构组件,发现卷积核感知范围受限是根本原因。基于这一关键发现,我们提出一种简单高效的再扩张方法,可在推理过程中动态调整卷积感知范围。进一步提出分散卷积与噪声衰减无分类器引导方法,实现了超高清图像生成(如4096×4096)。值得注意的是,我们的方法无需任何训练或优化。大量实验表明,该方法能有效解决重复问题,在高分辨率图像合成(尤其是纹理细节)上达到最优性能。本研究还证明,基于低分辨率图像训练的预训练扩散模型可直接用于高分辨率视觉生成而无需微调,为未来超高清图像与视频合成研究提供了新思路。