Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Although continuous-time consistency models (e.g., sCM, MeanFlow) are theoretically principled and empirically powerful for fast academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of evaluation benchmarks like FID. This work represents the first effort to scale up continuous-time consistency to general application-level image and video diffusion models, and to make JVP-based distillation effective at large scale. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM generally matches the state-of-the-art distillation method DMD2 on quality metrics while mitigating mode collapse and offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation. Code is available at https://github.com/NVlabs/rcm.

翻译：尽管连续时间一致性模型（如sCM、MeanFlow）在理论上具有严谨性，并在快速学术级扩散任务中展现出强大的实证性能，但由于雅可比-向量积计算的基础设施挑战以及FID等评估基准的局限性，其在大规模文本到图像和视频任务中的应用前景尚不明确。本研究首次尝试将连续时间一致性方法扩展至通用应用级图像与视频扩散模型，并实现基于JVP的大规模蒸馏优化。我们首先开发了并行兼容的FlashAttention-2 JVP计算核，使得sCM能够训练超过100亿参数的模型并处理高维视频任务。研究发现sCM在细节生成方面存在本质性质量局限，我们将其归因于误差累积及其前向散度目标的"模式覆盖"特性。为此，我们提出分数正则化连续时间一致性模型，通过引入分数蒸馏作为长跳跃正则化器，将"模式寻求"的反向散度与sCM相结合，在保持高生成多样性的同时有效提升视觉质量。在参数量高达140亿的Cosmos-Predict2、Wan2.1等大规模模型及5秒视频任务上的验证表明，rCM在质量指标上普遍达到当前最先进蒸馏方法DMD2的水平，同时缓解模式坍缩问题并在多样性方面具有显著优势，且无需GAN调优或大量超参数搜索。蒸馏后的模型仅需$1\sim4$步即可生成高保真样本，将扩散采样速度提升$15\times\sim50\times$。这些成果使rCM成为推进大规模扩散蒸馏的实用化理论框架。代码发布于https://github.com/NVlabs/rcm。