The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.
翻译:一致性模型(CM)近期在加速扩散模型生成方面取得了显著进展。然而,其在潜空间中进行高分辨率、文本条件图像生成(即LCM)的应用仍不尽如人意。本文指出了当前LCM设计中存在的三个关键缺陷。我们深入探究了这些局限性的成因,并提出了分阶段一致性模型(PCM),该模型拓展了设计空间并解决了所有已识别的问题。评估结果表明,PCM在1至16步生成设置中均显著优于LCM。尽管PCM专门针对多步优化设计,但其单步生成结果甚至优于或可与先前专门设计的单步方法相媲美。此外,我们证明了PCM的方法具有普适性,可应用于视频生成领域,使我们能够训练出当前最优的少步文本到视频生成器。更多细节请访问 https://g-u-n.github.io/projects/pcm/。