We introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis. This is achieved by integrating the diffusion process with GANs. Specifically, we smooth the distribution by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we show that our method can be extended to finetune pre-trained text-to-image diffusion for high-quality one-step text-to-image synthesis even with LoRA fine-tuning. In particular, we provide the first diffusion transformer that can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without explicit training. Our code is provided at https://github.com/Luo-Yihong/YOSO.
翻译:我们提出YOSO,一种专为快速、可扩展且高保真单步图像合成设计的新型生成模型。该模型通过将扩散过程与生成对抗网络(GANs)相融合实现目标。具体而言,我们通过去噪生成器自身平滑数据分布,进行自协同学习。实验表明,我们的方法可以作为从零开始训练的单步生成模型,并展现出具有竞争力的性能。此外,我们证明该方法可扩展至微调预训练的文本到图像扩散模型,即使采用LoRA微调也能实现高质量单步文本到图像合成。值得关注的是,我们首次提供了能在512分辨率下实现单步图像生成的扩散Transformer,且无需显式训练即可适配1024分辨率。我们的代码发布于https://github.com/Luo-Yihong/YOSO。