Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature. Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model. However, its increased computation cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation. To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL. To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis. Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part. With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B & -700M, while reducing the model size up to 54% and 69% of the original SDXL model. In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality. We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.
翻译:稳定扩散模型因其出色的生成性能和开源特性,成为社区中文本到图像(T2I)合成的主流方法。近期,作为稳定扩散继承者的Stable Diffusion XL(SDXL),凭借1024×1024高分辨率带来的显著性能提升和更大的模型规模,引起了广泛关注。然而,其计算成本与模型体积的增长要求终端用户配备更高端的硬件(如更大显存的GPU),导致运营成本攀升。针对这一问题,本文通过蒸馏SDXL的知识,提出一种高效的文本到图像合成潜在扩散模型。为此,我们首先深入分析了SDXL中作为主要瓶颈的去噪U-Net结构,并基于分析结果设计了更高效的U-Net。其次,我们探索了如何将SDXL的生成能力有效蒸馏至高效U-Net中,最终识别出四个关键因素,其核心在于自注意力机制是最重要的部分。基于高效U-Net与自注意力知识蒸馏策略,我们构建了名为KOALA-1B和KOALA-700M的高效T2I模型,模型体积较原始SDXL分别减少54%和69%。特别地,KOALA-700M的推理速度是SDXL的两倍以上,同时保持不错的生成质量。希望凭借平衡的速度-性能权衡,我们的KOALA模型能在资源受限环境中作为SDXL的高性价比替代方案。