We introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis. YOSO integrates the diffusion process with GANs to achieve the best of two worlds. Specifically, we smooth the distribution by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we show that our method can be extended to finetune pre-trained text-to-image diffusion for high-quality one-step text-to-image synthesis even with LoRA fine-tuning. In particular, we provide the first diffusion transformer that can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO
翻译:本文提出YOSO,一种专为快速、可扩展且高保真的一步式图像合成而设计的新型生成模型。YOSO将扩散过程与生成对抗网络相结合,实现了两种范式的优势互补。具体而言,我们通过去噪生成器自身对分布进行平滑化处理,执行自协作学习。实验表明,我们的方法可作为从零开始训练的一步生成模型,并具备有竞争力的性能。此外,我们还证明该方法可扩展至对预训练文本到图像扩散模型进行微调,即使采用LoRA微调技术,也能实现高质量的一步式文本到图像合成。特别地,我们首次实现了在512分辨率训练、无需额外显式训练即可自适应1024分辨率的单步图像生成扩散Transformer模型。代码已发布于https://github.com/Luo-Yihong/YOSO。