Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like "a pink sunflower and a yellow flamingo" may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.
翻译:文本条件图像生成模型常常在实体与其视觉属性之间产生错误的关联。这反映出提示词中实体与修饰词的语言绑定与生成图像中对应元素的视觉绑定之间存在映射缺陷。一个典型例子是,像"一朵粉色向日葵和一只黄色火烈鸟"这样的查询,可能会错误地生成一朵黄色向日葵和一只粉色火烈鸟的图像。为解决此问题,我们提出SynGen方法:首先通过句法分析识别提示词中的实体及其修饰词,然后采用一种新颖的损失函数,促使交叉注意力图与句法所反映的语言绑定保持一致。具体而言,我们鼓励实体与其修饰词的注意力图高度重叠,同时限制其与其他实体及修饰词的重叠。该损失函数在推理阶段进行优化,无需重新训练或微调模型。在三个数据集(包括一个具有挑战性的新数据集)上进行的人工评估表明,SynGen相比当前最先进方法取得了显著改进。本工作揭示了在推理过程中利用句子结构如何能够高效且实质性地提升文本到图像生成的忠实度。