Emerging large-scale text-to-image generative models, e.g., Stable Diffusion (SD), have exhibited overwhelming results with high fidelity. Despite the magnificent progress, current state-of-the-art models still struggle to generate images fully adhering to the input prompt. Prior work, Attend & Excite, has introduced the concept of Generative Semantic Nursing (GSN), aiming to optimize cross-attention during inference time to better incorporate the semantics. It demonstrates promising results in generating simple prompts, e.g., ``a cat and a dog''. However, its efficacy declines when dealing with more complex prompts, and it does not explicitly address the problem of improper attribute binding. To address the challenges posed by complex prompts or scenarios involving multiple entities and to achieve improved attribute binding, we propose Divide & Bind. We introduce two novel loss objectives for GSN: a novel attendance loss and a binding loss. Our approach stands out in its ability to faithfully synthesize desired objects with improved attribute alignment from complex prompts and exhibits superior performance across multiple evaluation benchmarks. More videos and updates can be found on the project page \url{https://sites.google.com/view/divide-and-bind}.
翻译:新兴的大规模文本到图像生成模型(如Stable Diffusion,SD)已展现出令人瞩目的高保真结果。尽管进展显著,当前最先进的模型在生成完全符合输入提示的图像方面仍面临挑战。先前工作Attend & Excite引入了生成式语义护理(GSN)的概念,旨在推理过程中优化交叉注意力以更好地融合语义。该方法在生成简单提示(如"一只猫和一只狗")时表现出令人满意的效果,但在处理更复杂提示时其有效性下降,且未能明确解决属性绑定不当的问题。为应对复杂提示或多实体场景带来的挑战,并实现更优的属性绑定,我们提出了Divide & Bind方法。我们为GSN引入两项新颖的损失目标:一种新颖的注意力损失和一种绑定损失。我们的方法在从复杂提示中忠实合成目标对象并改善属性对齐方面表现突出,在多个评估基准上展现出卓越性能。更多视频与更新请见项目页面\url{https://sites.google.com/view/divide-and-bind}。