Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.
翻译:基于扩散过程的文本到图像(T2I)模型在利用用户提供的描述文本进行可控图像生成方面取得了显著成功。然而,当前T2I模型中文本编码器与图像解码器之间的紧密耦合使得替换或升级变得困难重重。此类改动往往需要大规模微调,甚至从零开始训练,其成本令人望而却步。为解决这一问题,我们提出GlueGen,该方法采用新提出的GlueNet模型,将单模态或多模态编码器的特征与现有T2I模型的潜在空间对齐。该方法引入了一种新的训练目标,利用平行语料库对齐不同编码器的表示空间。实验结果表明,GlueNet能够高效训练,并实现超越先前最先进模型的多种能力:1)多语言语言模型(如XLM-Roberta)可与现有T2I模型对齐,从而基于非英语描述文本生成高质量图像;2)GlueNet可将AudioCLIP等多模态编码器与Stable Diffusion模型对齐,实现声音到图像的生成;3)它还能升级潜在扩散模型当前的文本编码器,以处理更具挑战性的生成案例。通过对齐多种特征表示,GlueNet能够灵活高效地将新功能集成到现有T2I模型中,并为X到图像(X2I)生成开辟了新方向。