Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability.
翻译:尽管文本到图像(T2I)生成模型取得了显著进展,但即使使用冗长复杂的文本描述,仍然难以提供精细的控制。相比之下,版式到图像(L2I)生成旨在根据用户指定的布局生成逼真且复杂的场景图像,因此日益受到关注。然而,现有方法在生成过程中将布局信息转化为标记或RGB图像进行条件控制,导致对单个实例的空间和语义可控性不足。为解决这些局限,我们提出了一种新颖的空间-语义地图引导(SSMG)扩散模型,该模型采用从布局中提取的特征图作为引导。由于精心设计的特征图封装了丰富的空间和语义信息,与先前工作相比,SSMG在生成质量上实现了更优效果,并具备充分的空间和语义可控性。此外,我们提出了关系敏感注意力(RSA)和位置敏感注意力(LSA)机制。前者旨在建模场景中多个对象之间的关系,而后者则旨在增强模型对引导中嵌入的空间信息的敏感性。大量实验表明,SSMG实现了极具前景的结果,在保真度、多样性和可控性等一系列指标上均达到了新的最优水平。