Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, these models fail to semantically align the generated images with the text descriptions due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these three issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective, and can be readily integrated into existing cross-attention-diffusion-based T2I generators. We compare our approach to competing methods and demonstrate that it not only faithfully conveys the semantics of the original text to the generated content, but also achieves high availability as a ready-to-use plugin.
翻译:近期基于文本到图像(T2I)的扩散模型在根据文本提示生成高质量图像方面展现出卓越性能。然而,由于这些模型在组合能力上的局限性,生成图像与文本描述之间存在语义对齐问题,导致属性泄漏、实体泄漏及实体缺失。本文针对上述三个问题,提出一种基于预测目标框的新型注意力掩码控制策略。具体而言,我们首先训练BoxNet为提示中具有指定属性的每个实体预测边界框,随后根据预测框对交叉注意力图与自注意力图施加独特的掩码控制。该方法通过约束提示中每个令牌在图像中的注意力区域,实现更精准的语义合成。此外,所提方法简单高效,可便捷集成至现有基于交叉注意力扩散的T2I生成器中。与现有方法的对比实验表明,该方法不仅能将原始文本语义忠实地传递至生成内容,还具备即插即用的高实用性。