Recent advances in text-to-image diffusion models have enabled the photorealistic generation of images from text prompts. Despite the great progress, existing models still struggle to generate compositional multi-concept images naturally, limiting their ability to visualize human imagination. While several recent works have attempted to address this issue, they either introduce additional training or adopt guidance at inference time. In this work, we consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model, and with almost no extra cost. To achieve this goal, we identify the limitations in the text embeddings used for the pre-trained text-to-image diffusion models. Specifically, we observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance. We further design a minimal low-cost solution that overcomes the above issues by tweaking (not re-training) the text embeddings for more realistic multi-concept text-to-image generation. Our Correction by Similarities method tweaks the embedding of concepts by collecting semantic features from most similar tokens to localize the contribution. To avoid mixing features of concepts, we also apply Cross-Token Non-Maximum Suppression, which excludes the overlap of contributions from different concepts. Experiments show that our approach outperforms previous methods in text-to-image, image manipulation, and personalization tasks, despite not introducing additional training or inference costs to the diffusion steps.
翻译:近期文本到图像扩散模型的进展使得从文本提示生成逼真图像成为可能。尽管取得了巨大进步,现有模型在自然生成复合多概念图像方面仍存在困难,限制了其可视化人类想象的能力。虽然近年来的多项工作试图解决这一问题,但它们要么引入额外训练,要么在推理时采用引导策略。在本工作中,我们追求一个更具挑战性的目标:利用预训练扩散模型实现自然的多概念生成,且几乎不增加额外成本。为实现这一目标,我们发现了预训练文本到图像扩散模型所用文本嵌入的局限性。具体而言,我们观察到概念主导性和非局部化贡献严重降低了多概念生成的性能。我们进一步设计了一种低成本的极简解决方案,通过调整(而非重新训练)文本嵌入来克服上述问题,从而实现更逼真的多概念文本到图像生成。我们的"相似性修正"方法通过收集最相似标记的语义特征来定位贡献,从而调整概念嵌入。为避免概念特征混合,我们还应用了跨标记非极大值抑制,排除了不同概念贡献的重叠。实验表明,我们的方法在文本到图像生成、图像处理和个性化任务中均优于以往方法,且未向扩散步骤引入额外训练或推理成本。