Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

from arxiv, Introducing DiMR, a new diffusion backbone that surpasses all existing image generation models of various sizes on ImageNet 256 with only 505M parameters. Project page: https://qihao067.github.io/projects/DiMR

This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via "patchification"), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: https://qihao067.github.io/projects/DiMR

翻译：本文通过集成一种新颖的多分辨率网络和时变层归一化，提出了对扩散模型的创新性改进。扩散模型因其在高保真图像生成方面的有效性而备受关注。传统方法依赖于卷积U-Net架构，而近期基于Transformer的设计已展现出更优越的性能和可扩展性。然而，由于自注意力操作相对于标记长度的二次复杂度，对输入数据进行标记化（通过"分块化"）的Transformer架构面临着视觉保真度与计算复杂度之间的权衡。虽然较大的分块尺寸能提高注意力计算效率，但难以捕捉细粒度的视觉细节，从而导致图像失真。为应对这一挑战，我们提出用多分辨率网络（DiMR）增强扩散模型，该框架可在多个分辨率下细化特征，从低分辨率到高分辨率逐步提升细节。此外，我们引入了时变层归一化（TD-LN），这是一种参数高效的方法，将时变参数纳入层归一化以注入时间信息，从而实现更优性能。我们在类别条件ImageNet生成基准测试中验证了本方法的有效性：DiMR-XL变体在ImageNet 256×256上取得了1.70的FID分数，在ImageNet 512×512上取得了2.89的FID分数，超越了现有扩散模型并创造了新的最优性能记录。项目页面：https://qihao067.github.io/projects/DiMR