Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating higher-resolution images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net (RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention (MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096x4096 at 1.5-6x the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.
翻译:扩散模型已成为高分辨率图像合成的主流方法。然而,直接利用预训练扩散模型生成更高分辨率的图像会出现对象重复问题,且生成时间呈指数级增长。本文发现,对象重复源于U-Net深层模块中的特征重复。同时,我们将生成时间延长归因于U-Net顶部模块中的自注意力冗余。为解决这些问题,我们提出了一种免调优的高分辨率框架HiDiffusion。具体而言,HiDiffusion包含分辨率感知U-Net(RAU-Net),可动态调整特征图尺寸以解决对象重复问题,并采用优化的移位窗口多头自注意力(MSW-MSA),通过窗口注意力机制降低计算量。HiDiffusion可集成到多种预训练扩散模型中,将图像生成分辨率扩展至4096×4096,推理速度达到先前方法的1.5-6倍。大量实验表明,我们的方法能够解决对象重复和计算负担过重的问题,在高分辨率图像合成任务中实现了最先进的性能。