We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
翻译:我们提出了SDXL,一种用于文本到图像合成的潜在扩散模型。与之前的Stable Diffusion版本相比,SDXL采用了三倍大小的UNet骨干网络:模型参数的增加主要源于更多的注意力模块以及SDXL使用第二个文本编码器带来的更大跨注意力上下文。我们设计了多种新颖的条件化方案,并在多种宽高比下对SDXL进行训练。我们还引入了一个精炼模型,通过后处理图像到图像技术来提升SDXL生成样本的视觉保真度。我们证明,与之前的Stable Diffusion版本相比,SDXL在性能上有了显著提升,并达到了与黑盒最先进图像生成器相媲美的结果。为促进开放研究并增进大规模模型训练与评估的透明度,我们在https://github.com/Stability-AI/generative-models上提供了代码和模型权重。