Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., ``a cat and a dog''. However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks.
翻译:新兴的大规模文本到图像生成模型,例如Stable Diffusion (SD),已展现出令人瞩目的高保真结果。尽管取得了巨大进展,当前最先进的模型在生成完全符合输入提示的图像方面仍存在困难。先前的工作Attend & Excite引入了生成式语义护理(GSN)的概念,旨在推理过程中优化交叉注意力以更好地融合语义。它在生成简单提示(如“一只猫和一只狗”)时表现出令人鼓舞的结果。然而,在处理更复杂的提示时,其有效性下降,且未明确解决属性绑定不当的问题。为应对复杂提示或多实体场景带来的挑战,并实现更好的属性绑定,我们提出Divide & Bind。我们为GSN引入两种新颖的损失目标:一种新颖的注意力损失和一种绑定损失。我们的方法在从复杂提示中忠实合成期望对象并改善属性对齐方面表现出色,且在多个评估基准上展示了优越的性能。