Recent diffusion-based generators can produce high-quality images based only on textual prompts. However, they do not correctly interpret instructions that specify the spatial layout of the composition. We propose a simple approach that can achieve robust layout control without requiring training or fine-tuning the image generator. Our technique, which we call layout guidance, manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the reconstruction in the desired direction given, e.g., a user-specified layout. In order to determine how to best guide attention, we study the role of different attention maps when generating images and experiment with two alternative strategies, forward and backward guidance. We evaluate our method quantitatively and qualitatively with several experiments, validating its effectiveness. We further demonstrate its versatility by extending layout guidance to the task of editing the layout and context of a given real image.
翻译:近期基于扩散的生成器能够仅凭文本提示生成高质量图像。然而,它们无法正确解读指定构图空间布局的指令。我们提出了一种简单方法,无需对图像生成器进行训练或微调即可实现稳健的布局控制。该技术称为布局引导,通过操纵模型用于对接文本与视觉信息的交叉注意力层,将重建过程导向用户指定的布局方向。为确定最优注意力引导方式,我们研究了不同注意力图在图像生成过程中的作用,并实验了两种替代策略:前向引导与反向引导。我们通过多组实验从定量与定性角度评估了该方法,验证了其有效性。通过将布局引导扩展至真实图像的布局与上下文编辑任务,进一步证明了其通用性。