To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
翻译:为加速采样过程,扩散模型常被蒸馏为可将噪声直接一步映射至数据的生成器。在此方法中,生成器的分辨率从根本上受限于教师扩散模型的分辨率。为突破此限制,本文提出扩散自编码器的渐进式增长方法(PaGoDA),该技术可使生成器的分辨率渐进式增长至超越原始教师扩散模型的分辨率。我们的核心洞见在于:通过沿时间正向求解PF-ODE(从数据到噪声),预训练的低分辨率扩散模型可将高分辨率数据确定性地编码至结构化潜空间,其初始输入为经过适当下采样的图像。在自编码器框架中利用此冻结编码器,我们通过渐进式增长的方式训练解码器的分辨率。基于解码器的渐进增长特性,PaGoDA在对学生模型进行上采样时可避免重新训练教师/学生模型,从而显著降低整体训练成本。实验中,我们使用渐进增长解码器将预训练模型的64x64分辨率上采样至512x512样本生成,相比LCM等单步蒸馏的Stable Diffusion实现了2倍的推理加速。PaGoDA在ImageNet数据集上从64x64到512x512的所有分辨率均取得了最先进的FID指标。此外,我们验证了PaGoDA在求解逆问题及实现可控生成方面的有效性。