Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.
翻译:去噪生成模型,如扩散模型和基于流的模型,能够生成高质量样本,但由于离散化误差,需要许多去噪步骤。流映射通过估计时间步之间的平均速度来缓解这一误差,从而实现更快的采样。然而,其训练通常需要架构调整,这限制了与预训练流模型的兼容性。我们提出了解耦均值流,一种简单的解码策略,无需架构修改即可将流模型转换为流映射模型。我们的方法将扩散Transformer的最后几块条件化于后续时间步,使得预训练的流模型可以直接重新用作流映射。结合增强的训练技术,该设计能够在少至1到4步内实现高质量生成。值得注意的是,我们发现先训练流模型再将其转换,比从头训练流映射更高效且效果更佳。在ImageNet 256x256和512x512数据集上,我们的模型分别实现了单步FID为2.16和2.12,大幅超越了现有技术。此外,当步数增加到4步时,我们实现了FID为1.51和1.68,这几乎与流模型的性能相当,同时推理速度提升了超过100倍。