Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
翻译:大规模扩散模型在文图合成任务中已取得最先进成果。尽管这些模型能够生成高质量且富有创意的图像,但我们观察到属性绑定与组合能力仍是主要挑战性问题,尤其当涉及多个物体时。本文旨在提升文本到图像模型的组合技能,具体包括更精确的属性绑定与更优的图像构图。为此,我们基于扩散文图模型中交叉注意力层的可控特性,将语言结构融入扩散引导过程。研究发现交叉注意力层中的键和值具有与物体布局和内容紧密相关的强语义含义。因此,通过依据语言学洞见操控交叉注意力表征,我们能在生成图像中更好地保留组合语义。基于最先进的文图模型Stable Diffusion,我们的结构化交叉注意力设计高效且无需额外训练样本。在定性与定量结果中,我们实现了更优的组合技能,在头对头用户对比研究中获得5-8%的优势。最后,我们通过深度分析揭示了不当图像构图的潜在成因,并论证了生成过程中交叉注意力层的特性。