HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models

We introduce HiDiffusion, a tuning-free framework comprised of Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Multi-head Self-Attention (MSW-MSA) to enable pretrained large text-to-image diffusion models to efficiently generate high-resolution images (e.g. 1024$\times$1024) that surpass the training image resolution. Pretrained diffusion models encounter unreasonable object duplication in generating images beyond the training image resolution. We attribute it to the mismatch between the feature map size of high-resolution images and the receptive field of U-Net's convolution. To address this issue, we propose a simple yet scalable method named RAU-Net. RAU-Net dynamically adjusts the feature map size to match the convolution's receptive field in the deep block of U-Net. Another obstacle in high-resolution synthesis is the slow inference speed of U-Net. Our observations reveal that the global self-attention in the top block, which exhibits locality, however, consumes the majority of computational resources. To tackle this issue, we propose MSW-MSA. Unlike previous window attention mechanisms, our method uses a much larger window size and dynamically shifts windows to better accommodate diffusion models. Extensive experiments demonstrate that our HiDiffusion can scale diffusion models to generate 1024$\times$1024, 2048$\times$2048, or even 4096$\times$4096 resolution images, while simultaneously reducing inference time by 40\%-60\%, achieving state-of-the-art performance on high-resolution image synthesis. The most significant revelation of our work is that a pretrained diffusion model on low-resolution images is scalable for high-resolution generation without further tuning. We hope this revelation can provide insights for future research on the scalability of diffusion models.

翻译：我们提出HiDiffusion，这是一个无需微调的框架，包含分辨率感知U-Net（RAU-Net）和改进型移位窗口多头自注意力（MSW-MSA），旨在使预训练的大型文本到图像扩散模型高效生成超越训练图像分辨率的高分辨率图像（例如1024×1024）。预训练扩散模型在生成超出训练图像分辨率的图像时会出现不合理的物体重复现象。我们将其归因于高分辨率图像的特征图尺寸与U-Net卷积感受野之间的不匹配。为解决此问题，我们提出一种简单且可扩展的方法——RAU-Net。RAU-Net在U-Net深层块中动态调整特征图尺寸以匹配卷积感受野。高分辨率合成中的另一个障碍是U-Net的推理速度缓慢。我们的观察表明，顶层块中具有局部性特征的全局自注意力却消耗了大部分计算资源。为解决此问题，我们提出MSW-MSA。与先前的窗口注意力机制不同，我们的方法使用更大的窗口尺寸并动态移位窗口，以更好地适配扩散模型。大量实验表明，我们的HiDiffusion能够将扩散模型扩展到生成1024×1024、2048×2048甚至4096×4096分辨率图像，同时将推理时间减少40%-60%，在高分辨率图像合成上达到最先进性能。我们工作的最重要启示是：基于低分辨率图像预训练的扩散模型无需进一步微调即可扩展用于高分辨率生成。我们希望这一启示能为未来关于扩散模型可扩展性的研究提供见解。