Recent diffusion-based generators can produce high-quality images from textual prompts. However, they often disregard textual instructions that specify the spatial layout of the composition. We propose a simple approach that achieves robust layout control without the need for training or fine-tuning of the image generator. Our technique manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the generation in the desired direction given, e.g., a user-specified layout. To determine how to best guide attention, we study the role of attention maps and explore two alternative strategies, forward and backward guidance. We thoroughly evaluate our approach on three benchmarks and provide several qualitative examples and a comparative analysis of the two strategies that demonstrate the superiority of backward guidance compared to forward guidance, as well as prior work. We further demonstrate the versatility of layout guidance by extending it to applications such as editing the layout and context of real images.
翻译:近期基于扩散的生成模型能够从文本提示生成高质量图像,但这些模型常忽略指定空间布局的文本指令。我们提出一种无需对图像生成器进行训练或微调即可实现稳健布局控制的简单方法。该方法通过操控模型用于连接文本与视觉信息的交叉注意力层,根据用户指定的布局等条件引导生成方向。为确定最优注意力引导策略,我们研究了注意力图的作用,并探索了前向引导与后向引导两种替代方案。我们在三个基准数据集上进行了全面评估,提供了多个定性示例及两种策略的对比分析,结果表明后向引导优于前向引导及现有方法。进一步地,我们通过将布局引导扩展至真实图像的布局与背景编辑等应用场景,验证了该方法的通用性。