Diffusion models have shown remarkable performance in image generation in recent years. However, due to a quadratic increase in memory during generating ultra-high-resolution images (e.g. 4096*4096), the resolution of generated images is often limited to 1024*1024. In this work. we propose a unidirectional block attention mechanism that can adaptively adjust the memory overhead during the inference process and handle global dependencies. Building on this module, we adopt the DiT structure for upsampling and develop an infinite super-resolution model capable of upsampling images of various shapes and resolutions. Comprehensive experiments show that our model achieves SOTA performance in generating ultra-high-resolution images in both machine and human evaluation. Compared to commonly used UNet structures, our model can save more than 5x memory when generating 4096*4096 images. The project URL is https://github.com/THUDM/Inf-DiT.
翻译:近年来,扩散模型在图像生成领域展现出卓越性能。然而,由于生成超高分辨率图像(例如4096*4096)时内存开销呈二次增长,生成图像的分辨率通常被限制在1024*1024。本文提出一种单向块注意力机制,该机制能在推理过程中自适应调节内存开销并处理全局依赖关系。基于该模块,我们采用DiT结构进行超分辨率处理,开发出能够对任意形状和分辨率图像进行上采样的无限超分辨率模型。综合实验表明,该模型在机器评估和人工评估中均达到生成超高分辨率图像的最新水平。与常用的UNet结构相比,生成4096*4096图像时,本模型可节省超过5倍内存。项目网址为https://github.com/THUDM/Inf-DiT。