Diffusion models have made impressive progress in text-to-image synthesis. However, training such large-scale models (e.g. Stable Diffusion), from scratch requires high computational costs and massive high-quality text-image pairs, which becomes unaffordable in other languages. To handle this challenge, we propose IAP, a simple but effective method to transfer English Stable Diffusion into Chinese. IAP optimizes only a separate Chinese text encoder with all other parameters fixed to align Chinese semantics space to the English one in CLIP. To achieve this, we innovatively treat images as pivots and minimize the distance of attentive features produced from cross-attention between images and each language respectively. In this way, IAP establishes connections of Chinese, English and visual semantics in CLIP's embedding space efficiently, advancing the quality of the generated image with direct Chinese prompts. Experimental results show that our method outperforms several strong Chinese diffusion models with only 5%~10% training data.
翻译:扩散模型在文本到图像合成领域取得了显著进展。然而,从头训练这类大规模模型(如Stable Diffusion)需要高昂的计算成本和大量高质量图文对,这在其他语言中难以承担。为应对这一挑战,我们提出IAP方法,一种简洁而有效的将英文Stable Diffusion迁移至中文的方案。IAP仅优化独立的中文文本编码器,同时冻结其他所有参数,以实现中文语义空间与CLIP中英文语义空间的对齐。为此,我们创新性地以图像为桥梁,最小化图像与各语言交叉注意力所生成注意力特征之间的距离。通过这种方式,IAP在CLIP嵌入空间中高效建立了中文、英文与视觉语义之间的关联,从而提升了直接使用中文提示词生成图像的质量。实验结果表明,我们的方法仅需5%~10%的训练数据,即可超越多个强基线中文扩散模型。