Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
翻译:扩散模型在图像和视频生成中已被证明非常有效;然而,由于单尺度训练数据,它们在生成不同尺寸图像时仍面临构图挑战。将大型预训练扩散模型适配至高分辨率需要大量计算和优化资源,但达到与低分辨率模型相当的生成能力仍难以实现。本文提出一种新颖的自级联扩散模型,利用充分训练的低分辨率模型的丰富知识,通过无微调或廉价上采样器微调范式,快速适配至更高分辨率的图像和视频生成。该自级联扩散模型集成了多尺度上采样器模块序列,可高效适配至更高分辨率,同时保留原始构图与生成能力。我们进一步提出一种枢轴引导噪声重调度策略,以加速推理过程并改善局部结构细节。与完全微调相比,我们的方法实现了5倍训练加速,且仅需额外0.002M微调参数。大量实验表明,我们的方法可通过仅10k步微调快速适配至更高分辨率图像与视频合成,且几乎不增加额外推理时间。