Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like "a pink sunflower and a yellow flamingo" may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.
翻译:基于文本条件的图像生成模型常常错误关联实体与其视觉属性,这反映出提示词中实体与修饰词的语言绑定与生成图像中对应元素的视觉绑定之间存在映射缺陷。一个典型例子是,当输入提示词"一朵粉色向日葵和一只黄色火烈鸟"时,可能错误生成黄色向日葵与粉色火烈鸟的图像。为解决此问题,我们提出SynGen方法:首先通过句法分析识别提示词中的实体及其修饰词,然后设计一种新型损失函数,促使交叉注意力图与句法结构反映的语言绑定保持一致。具体而言,我们强化实体与其修饰词注意力图的高重叠率,同时控制其与其它实体及修饰词的弱重叠。该损失在推理阶段直接优化,无需重新训练或微调模型。基于三个数据集(包括一个全新且具有挑战性的测试集)的人工评估表明,SynGen较现有最优方法具有显著提升。这项工作揭示了如何在推理阶段利用句子结构高效且实质性地提升文本到图像生成的忠实度。