Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.
翻译:近期,基于文本到图像的扩散模型展现出根据文本输入生成高质量图像的能力。然而,这些模型在准确遵循涉及空间布局信息的文本指令方面仍存在困难。尽管先前研究主要聚焦于对齐交叉注意力图与布局条件,但它们忽略了初始化噪声对布局引导的影响。为实现更优的布局控制,我们提出在去噪过程中利用空间感知的初始化噪声。具体而言,我们发现经过有限反演步骤的参考图像反演结果包含关于物体位置的有效空间感知信息,从而在生成图像中产生相似的布局。基于这一发现,我们开发了一种开放词汇框架,可为每种布局条件定制空间感知的初始化噪声。除初始化噪声外无需修改其他模块,我们的方法可作为即插即用模块无缝集成到其他无训练的布局引导框架中。我们在现有Stable Diffusion模型和COCO数据集上进行了定性与定量评估。配备空间感知潜变量初始化后,我们的方法在保持高质量内容的同时显著提升了布局引导的有效性。