Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else

Recent advances in text-to-image diffusion models have enabled the photorealistic generation of images from text prompts. Despite the great progress, existing models still struggle to generate compositional multi-concept images naturally, limiting their ability to visualize human imagination. While several recent works have attempted to address this issue, they either introduce additional training or adopt guidance at inference time. In this work, we consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model, and with almost no extra cost. To achieve this goal, we identify the limitations in the text embeddings used for the pre-trained text-to-image diffusion models. Specifically, we observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance. We further design a minimal low-cost solution that overcomes the above issues by tweaking (not re-training) the text embeddings for more realistic multi-concept text-to-image generation. Our Correction by Similarities method tweaks the embedding of concepts by collecting semantic features from most similar tokens to localize the contribution. To avoid mixing features of concepts, we also apply Cross-Token Non-Maximum Suppression, which excludes the overlap of contributions from different concepts. Experiments show that our approach outperforms previous methods in text-to-image, image manipulation, and personalization tasks, despite not introducing additional training or inference costs to the diffusion steps.

翻译：近期文本到图像扩散模型的进展使得从文本提示生成逼真图像成为可能。尽管取得了巨大进步，现有模型在自然生成复合多概念图像方面仍存在困难，限制了其可视化人类想象的能力。虽然近年来的多项工作试图解决这一问题，但它们要么引入额外训练，要么在推理时采用引导策略。在本工作中，我们追求一个更具挑战性的目标：利用预训练扩散模型实现自然的多概念生成，且几乎不增加额外成本。为实现这一目标，我们发现了预训练文本到图像扩散模型所用文本嵌入的局限性。具体而言，我们观察到概念主导性和非局部化贡献严重降低了多概念生成的性能。我们进一步设计了一种低成本的极简解决方案，通过调整（而非重新训练）文本嵌入来克服上述问题，从而实现更逼真的多概念文本到图像生成。我们的"相似性修正"方法通过收集最相似标记的语义特征来定位贡献，从而调整概念嵌入。为避免概念特征混合，我们还应用了跨标记非极大值抑制，排除了不同概念贡献的重叠。实验表明，我们的方法在文本到图像生成、图像处理和个性化任务中均优于以往方法，且未向扩散步骤引入额外训练或推理成本。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日