Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free spatial control of text-to-image diffusion models by directly manipulating cross-attention maps. However, these approaches still suffer from misalignment to given masks because manipulated attention maps are far from actual ones learned by diffusion models. To address this issue, we propose masked-attention guidance, which can generate images more faithful to semantic masks via indirect control of attention to each word and pixel by manipulating noise images fed to diffusion models. Masked-attention guidance can be easily integrated into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the tasks of text-guided image editing. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
翻译:文本到图像合成随着扩散模型的最新进展已实现高质量结果。然而,仅靠文本输入存在严重的空间歧义性,且用户可控性有限。现有方法大多通过额外视觉指导(如草图和语义掩码)实现空间控制,但需要借助标注图像进行附加训练。本文提出一种无需对扩散模型进行额外训练即可实现空间控制文本到图像生成的方法。该方法基于以下洞见:交叉注意力图反映了词语与像素之间的位置关系。我们的目标是依据给定语义掩码和文本提示控制注意力图。为此,我们首先探索了一种直接方法——将交叉注意力图替换为由语义区域计算得到的恒定图。部分先前研究也通过直接操纵交叉注意力图实现了对文本到图像扩散模型的免训练空间控制。然而这些方法仍存在与给定掩码对齐不足的问题,因为操纵后的注意力图与扩散模型实际学习的注意力图差异较大。为解决该问题,我们提出掩码注意力指导,通过操纵输入扩散模型的噪声图像间接控制每个词语和像素的注意力,从而生成更符合语义掩码的图像。该技术可轻松集成至预训练现成扩散模型(如Stable Diffusion)中,并应用于文本引导图像编辑任务。实验结果表明,该方法在定性和定量层面均能实现比基线更精确的空间控制。