TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

翻译：近期大规模多模态生成模型在多模态生成（包括图像和视频生成）方面展现出令人瞩目的能力。这些模型通常建立在多步框架（如扩散模型和流匹配）之上，这本质上限制了其推理效率（需要40-100次函数评估次数）。尽管多种少步方法旨在加速推理，但现有解决方案存在明显局限。主流的基于蒸馏的方法（如渐进蒸馏和一致性蒸馏）要么需要迭代蒸馏过程，要么在极少数步数（<4-NFE）下表现出显著性能下降。同时，将对抗训练融入蒸馏过程（例如DMD/DMD2和SANA-Sprint）以提升性能，会因引入额外训练模型而导致训练不稳定、复杂度增加以及GPU内存开销高昂。为此，我们提出TwinFlow，一个简单而有效的训练一步生成模型的框架。该框架绕过了对固定预训练教师模型的需求，并在训练过程中避免了标准的对抗网络，使其成为构建大规模高效模型的理想选择。在文本到图像任务中，我们的方法在1-NFE下实现了0.83的GenEval分数，超越了SANA-Sprint（基于GAN损失的框架）和RCGM（基于一致性的框架）等强基线。值得注意的是，我们通过在Qwen-Image-20B上进行全参数训练，展示了TwinFlow的可扩展性，并将其转化为高效的少步生成器。仅需1-NFE，我们的方法在GenEval和DPG-Bench基准测试中即可达到原始100-NFE模型的性能，在质量轻微下降的同时将计算成本降低了$100\times$。项目页面位于 https://zhenglin-cheng.com/twinflow。