GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation.

翻译：基于扩散过程的文本到图像（T2I）模型在利用用户提供的描述文本进行可控图像生成方面取得了显著成功。然而，当前T2I模型中文本编码器与图像解码器之间的紧密耦合使得替换或升级变得困难重重。此类改动往往需要大规模微调，甚至从零开始训练，其成本令人望而却步。为解决这一问题，我们提出GlueGen，该方法采用新提出的GlueNet模型，将单模态或多模态编码器的特征与现有T2I模型的潜在空间对齐。该方法引入了一种新的训练目标，利用平行语料库对齐不同编码器的表示空间。实验结果表明，GlueNet能够高效训练，并实现超越先前最先进模型的多种能力：1）多语言语言模型（如XLM-Roberta）可与现有T2I模型对齐，从而基于非英语描述文本生成高质量图像；2）GlueNet可将AudioCLIP等多模态编码器与Stable Diffusion模型对齐，实现声音到图像的生成；3）它还能升级潜在扩散模型当前的文本编码器，以处理更具挑战性的生成案例。通过对齐多种特征表示，GlueNet能够灵活高效地将新功能集成到现有T2I模型中，并为X到图像（X2I）生成开辟了新方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2023】用于复杂场景图像合成的特征金字塔扩散模型

专知会员服务

22+阅读 · 2022年12月5日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

【ACMMM2021】问题控制的文本感知图像描述生成

专知会员服务

19+阅读 · 2021年9月23日