Ultra-high-resolution image generation poses great challenges, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions (\textit{e.g.}, 1K to 6K) within a single model, while maintaining computational efficiency. UltraPixel leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the whole generation of highly detailed high-resolution images, significantly reducing complexity. Furthermore, we introduce implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3$\%$ additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
翻译:超高分辨率图像生成面临诸多重大挑战,包括语义规划复杂度增加、细节合成困难以及巨大的训练资源需求。本文提出UltraPixel,一种利用级联扩散模型的新型架构,可在单一模型内生成多种分辨率(例如1K至6K)的高质量图像,同时保持计算效率。UltraPixel利用低分辨率图像在去噪后期阶段富含语义的表征来指导高细节度高分辨率图像的整体生成,从而显著降低复杂度。此外,我们引入隐式神经表征实现连续上采样,并采用适应不同分辨率的尺度感知归一化层。值得注意的是,低分辨率与高分辨率处理过程均在最紧凑的空间中进行,共享绝大部分参数,仅需增加不足3%的参数即可实现高分辨率输出,极大提升了训练与推理效率。我们的模型能以较少数据需求实现快速训练,生成具有照片真实感的高分辨率图像,并在大量实验中展现出最先进的性能。