Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
翻译:扩散模型凭借其U型架构和以卷积神经网络(CNN)为基础模块的设计,在图像合成领域展现出卓越能力。然而,CNN中卷积操作的局部性可能限制模型理解长程语义信息的能力。为解决这一问题,本文提出Yuan-TecSwin——一种集成Swin-transformer的文本条件扩散模型。该模型在编码器和解码器中用Swin-transformer模块替代CNN模块,以增强特征提取和图像重建过程中的非局部建模能力。通过精选文本编码器、有效利用文本嵌入向量以及精心设计的文本条件融合机制,文本-图像对齐性能得到显著提升。采用自适应时间步长在不同扩散阶段进行搜索,推理性能进一步提升10%。在ImageNet生成基准测试中,Yuan-TecSwin在无需多阶段去噪附加模型的条件下,取得了1.37的当前最优FID分数。在并行对比实验中,人类受试者难以区分模型生成图像与人工绘制图像。