Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
翻译:文本到图像扩散模型因其能够根据文本提示生成逼真图像而广为人知。然而,现有工作主要集中在英语上,缺乏对非英语文本到图像模型的支持。最常用的翻译方法无法解决与语言文化相关的生成问题,而在特定语言数据集上从头开始训练的成本又极其高昂。受此启发,本文提出了一种基于知识蒸馏的简单即插即用式语言迁移方法。我们只需在教师知识蒸馏的指导下,利用一个小的平行语料库,训练一个仅含6M参数的轻量级类MLP参数高效适配器(PEA)。我们惊讶地发现,即使冻结UNet的参数,仍能在语言特定的提示评估集上取得卓越性能,这表明PEA能够激发原始UNet的潜在生成能力。此外,在通用提示评估集上,其性能已非常接近英语文本到图像模型。更进一步,我们的适配器可作为插件使用,在跨语言文本到图像生成的下游任务中取得显著效果。代码将发布于:https://github.com/OPPO-Mente-Lab/PEA-Diffusion