Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.
翻译:近期大规模多模态生成模型在多模态生成(包括图像和视频生成)方面展现出令人瞩目的能力。这些模型通常建立在多步框架(如扩散模型和流匹配)之上,这本质上限制了其推理效率(需要40-100次函数评估次数)。尽管多种少步方法旨在加速推理,但现有解决方案存在明显局限。主流的基于蒸馏的方法(如渐进蒸馏和一致性蒸馏)要么需要迭代蒸馏过程,要么在极少数步数(<4-NFE)下表现出显著性能下降。同时,将对抗训练融入蒸馏过程(例如DMD/DMD2和SANA-Sprint)以提升性能,会因引入额外训练模型而导致训练不稳定、复杂度增加以及GPU内存开销高昂。为此,我们提出TwinFlow,一个简单而有效的训练一步生成模型的框架。该框架绕过了对固定预训练教师模型的需求,并在训练过程中避免了标准的对抗网络,使其成为构建大规模高效模型的理想选择。在文本到图像任务中,我们的方法在1-NFE下实现了0.83的GenEval分数,超越了SANA-Sprint(基于GAN损失的框架)和RCGM(基于一致性的框架)等强基线。值得注意的是,我们通过在Qwen-Image-20B上进行全参数训练,展示了TwinFlow的可扩展性,并将其转化为高效的少步生成器。仅需1-NFE,我们的方法在GenEval和DPG-Bench基准测试中即可达到原始100-NFE模型的性能,在质量轻微下降的同时将计算成本降低了$100\times$。项目页面位于 https://zhenglin-cheng.com/twinflow。