With the incorporation of the UNet architecture, diffusion probabilistic models have become a dominant force in image generation tasks. One key design in UNet is the skip connections between the encoder and decoder blocks. Although skip connections have been shown to improve training stability and model performance, we reveal that such shortcuts can be a limiting factor for the complexity of the transformation. As the sampling steps decrease, the generation process and the role of the UNet get closer to the push-forward transformations from Gaussian distribution to the target, posing a challenge for the network's complexity. To address this challenge, we propose Skip-Tuning, a simple yet surprisingly effective training-free tuning method on the skip connections. Our method can achieve 100% FID improvement for pretrained EDM on ImageNet 64 with only 19 NFEs (1.75), breaking the limit of ODE samplers regardless of sampling steps. Surprisingly, the improvement persists when we increase the number of sampling steps and can even surpass the best result from EDM-2 (1.58) with only 39 NFEs (1.57). Comprehensive exploratory experiments are conducted to shed light on the surprising effectiveness. We observe that while Skip-Tuning increases the score-matching losses in the pixel space, the losses in the feature space are reduced, particularly at intermediate noise levels, which coincide with the most effective range accounting for image quality improvement.
翻译:随着UNet架构的引入,扩散概率模型已成为图像生成领域的主导力量。UNet的一个关键设计是编码器与解码器模块之间的跳跃连接。尽管跳跃连接已被证明能提升训练稳定性和模型性能,但我们揭示出这种捷径可能成为变换复杂度的限制因素。当采样步数减少时,生成过程及UNet的作用愈发接近从高斯分布到目标分布的推进变换,这对网络复杂度提出了挑战。为解决此问题,我们提出跳跃调谐(Skip-Tuning)——一种针对跳跃连接的简单却惊人的无训练调谐方法。该方法仅需19次NFE(1.75)即可为ImageNet 64上预训练的EDM实现100%的FID改进,无论采样步数如何均突破ODE采样器的极限。令人惊讶的是,当增加采样步数时,该改进依然持续,甚至仅需39次NFE(1.57)即可超越EDM-2的最佳结果(1.58)。我们进行了全面探索性实验以阐明其惊人效果,观察到:跳跃调谐虽增加了像素空间的分数匹配损失,但降低了特征空间中的损失——尤其在中等噪声水平下,这与图像质量提升的最有效区间相吻合。