Research on text-to-image generation (TTI) still predominantly focuses on the English language due to the lack of annotated image-caption data in other languages; in the long run, this might widen inequitable access to TTI technology. In this work, we thus investigate multilingual TTI (termed mTTI) and the current potential of neural machine translation (NMT) to bootstrap mTTI systems. We provide two key contributions. 1) Relying on a multilingual multi-modal encoder, we provide a systematic empirical study of standard methods used in cross-lingual NLP when applied to mTTI: Translate Train, Translate Test, and Zero-Shot Transfer. 2) We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework, mitigating the language gap and thus improving mTTI performance. Our evaluations on standard mTTI datasets COCO-CN, Multi30K Task2, and LAION-5B demonstrate the potential of translation-enhanced mTTI systems and also validate the benefits of the proposed EnsAd which derives consistent gains across all datasets. Further investigations on model variants, ablation studies, and qualitative analyses provide additional insights on the inner workings of the proposed mTTI approaches.
翻译:文本到图像生成(TTI)研究仍主要集中于英语,其原因在于其他语言中缺乏带注释的图像-标题数据;长远来看,这可能会加剧TTI技术访问的不平等。因此,本文研究了多语言TTI(称为mTTI)以及神经机器翻译(NMT)当前在引导mTTI系统方面的潜力。我们提供两个关键贡献:1)依赖多语言多模态编码器,我们对跨语言自然语言处理中应用于mTTI的标准方法进行了系统性的实证研究:翻译训练、翻译测试和零样本迁移。2)我们提出集成适配器(EnsAd),一种新颖的参数高效方法,学习在mTTI框架内加权和整合多语言文本知识,从而缩小语言差距并提升mTTI性能。我们在标准mTTI数据集COCO-CN、Multi30K Task2和LAION-5B上的评估展示了翻译增强型mTTI系统的潜力,并验证了所提出的EnsAd在所有数据集上带来持续收益的优势。对模型变体、消融研究和定性分析的进一步探究,为所提出的mTTI方法的内部机制提供了额外洞见。