Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
翻译:大规模扩散模型已在文本到图像生成(T2I)任务上取得了最先进成果。尽管能生成高质量且富有创意的图像,但我们观察到属性绑定与合成能力仍是主要挑战,尤其在涉及多个物体时。本研究旨在提升T2I模型的合成能力,具体包括更精确的属性绑定和更优的图像构图。为此,我们基于扩散型T2I模型中交叉注意力层可操控的特性,将语言学结构融入扩散引导过程。研究发现,交叉注意力层中的键和值具有与物体布局及内容紧密相关的强语义含义。因此,通过根据语言学洞见操控交叉注意力表征,我们可在生成图像中更好地保留合成语义。基于当前最先进的T2I模型Stable Diffusion,我们的结构化交叉注意力设计无需额外训练样本即可高效运作。定性与定量结果表明,该方法实现了更优的合成能力,在头对头用户对比研究中获得5-8%的优势。最后,我们通过深入分析揭示了图像合成错误的潜在成因,并论证了生成过程中交叉注意力层的特性。