Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g, sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Moreover, we propose masked-attention guidance, which can generate images more faithful to semantic masks than the first approach. Masked-attention guidance indirectly controls attention to each word and pixel according to the semantic regions by manipulating noise images fed to diffusion models. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
翻译:文本到图像合成凭借扩散模型的最新进展已取得高质量结果。然而,仅依靠文本输入存在较高的空间模糊性,且用户可控性有限。现有方法大多通过额外的视觉引导(如草图与语义掩码)实现空间控制,但需要利用标注图像进行额外训练。本文提出一种无需额外训练扩散模型即可实现文本到图像空间控制的方法。该方法基于交叉注意力图能反映词语与像素间位置关系这一洞察,旨在根据给定语义掩码与文本提示控制注意力图。为此,我们首先探索了一种直接方法:将交叉注意力图替换为根据语义区域计算的恒定图。进而提出掩码注意力引导方法,该方法能比前一种方法生成更忠实于语义掩码的图像。掩码注意力引导通过操控输入扩散模型的噪声图像,间接控制每个词语与像素在语义区域上的注意力。实验表明,本方法在定性与定量上均能比基线方法实现更精准的空间控制。