Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference~(TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.
翻译:扩散模型能够实现卓越的生成质量,但由于其去噪过程的迭代性质,生成速度较慢。相比之下,一致性模型作为一种新的生成模型家族,在保持竞争力的性能的同时,实现了显著更快的采样速度。这些模型要么通过一致性蒸馏(利用预训练的扩散模型)进行训练,要么直接通过一致性训练/调优从原始数据中学习。在本工作中,我们提出了一个新颖的框架来理解一致性模型,该框架将扩散模型的去噪过程建模为一个马尔可夫决策过程,并将一致性模型训练视为通过时序差分学习进行的价值估计。更重要的是,该框架使我们能够分析当前一致性训练/调优策略的局限性。基于简易一致性调优,我们提出了稳定一致性调优,该方法通过引入基于分数恒等式的方差缩减学习。SCT在CIFAR-10和ImageNet-64等基准测试上带来了显著的性能提升。在ImageNet-64上,SCT实现了单步FID 2.42和两步FID 1.55,为一致性模型树立了新的最先进水平。