We introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis. YOSO integrates the diffusion process with GANs to achieve the best of two worlds. Specifically, we smooth the distribution by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we show that our method can be extended to finetune pre-trained text-to-image diffusion for high-quality one-step text-to-image synthesis even with LoRA fine-tuning. In particular, we provide the first diffusion transformer that can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO
翻译:本文提出YOSO,一种专为快速、可扩展且高保真的一步式图像合成而设计的新型生成模型。YOSO将扩散过程与生成对抗网络(GAN)相结合,以融合二者的优势。具体而言,我们通过去噪生成器自身对分布进行平滑处理,实现自协作学习。研究表明,我们的方法可作为从头训练的一步生成模型,并具备有竞争力的性能。此外,我们还证明该方法可扩展至对预训练文本到图像扩散模型进行微调,即使采用LoRA微调也能实现高质量的一步式文本到图像合成。特别地,我们首次提出了可在512分辨率训练的单步图像生成扩散Transformer模型,并具备无需额外显式训练即可适应1024分辨率的能力。代码发布于https://github.com/Luo-Yihong/YOSO