Large Text-to-Image(T2I) diffusion models have shown a remarkable capability to produce photorealistic and diverse images based on text inputs. However, existing works only support limited language input, e.g., English, Chinese, and Japanese, leaving users beyond these languages underserved and blocking the global expansion of T2I models. Therefore, this paper presents AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages. Specifically, we first train a multilingual text encoder based on the knowledge distillation. Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability, including concept alignment and quality improvement stage on a large-scale multilingual dataset. Furthermore, we introduce a new benchmark, which includes Multilingual-General-18(MG-18) and Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I diffusion models for generating high-quality images and capturing culture-specific concepts in different languages. Experimental results on both MG-18 and MC-18 demonstrate that AltDiffusion outperforms current state-of-the-art T2I models, e.g., Stable Diffusion in multilingual understanding, especially with respect to culture-specific concepts, while still having comparable capability for generating high-quality images. All source code and checkpoints could be found in https://github.com/superhero-7/AltDiffuson.
翻译:大型文本到图像(T2I)扩散模型展现出基于文本输入生成逼真且多样化图像的卓越能力。然而,现有工作仅支持有限的语言输入(例如英语、中文和日语),导致使用其他语言的用户服务不足,并阻碍了T2I模型的全球推广。为此,本文提出AltDiffusion,一种支持十八种语言的新型多语言T2I扩散模型。具体而言,我们首先基于知识蒸馏训练一个多语言文本编码器,随后将其嵌入预训练的纯英语扩散模型,并采用两阶段方案(包括大规模多语言数据集上的概念对齐与质量提升阶段)训练模型以增强多语言能力。此外,我们引入了一个新基准,包含多语言通用数据集(MG-18)和多语言文化数据集(MC-18),用于评估T2I扩散模型生成高质量图像以及捕捉不同语言中文化特定概念的能力。在MG-18和MC-18上的实验结果表明,AltDiffusion在多语言理解(尤其是文化特定概念方面)优于当前最先进的T2I模型(如Stable Diffusion),同时仍具备相当的高质量图像生成能力。所有源代码和检查点均可从https://github.com/superhero-7/AltDiffuson 获取。