Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.
翻译:基于在大规模图文配对数据集上训练的可扩展扩散模型,文生图合成方法展现了令人信服的成果。然而,当提示中包含多个对象、属性及空间组合时,这些模型仍无法精确遵循文本指令。本文识别了扩散模型交叉注意力层与自注意力层中的潜在原因,并提出两种新型损失函数,在采样过程中根据给定布局重聚焦注意力图。通过在大型语言模型生成的布局上对DrawBench和HRS基准进行综合实验,我们证明所提出的损失函数可简便有效地集成至现有文生图方法中,并持续提升生成图像与文本提示的对齐度。