Layout-to-image synthesis is an emerging technique in conditional image generation. It aims to generate complex scenes, where users require fine control over the layout of the objects in a scene. However, it remains challenging to control the object coherence, including semantic coherence (e.g., the cat looks at the flowers or not) and physical coherence (e.g., the hand and the racket should not be misaligned). In this paper, we propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules to guide the object coherence for this task. For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images. Instead of simply employing cross-attention between captions and generated images, which addresses the highly relevant layout restriction and semantic coherence separately and thus leads to unsatisfying results shown in our experiments, we develop GSF to fuse the supervision from the layout restriction and semantic coherence requirement and exploit it to guide the image synthesis process. Moreover, to improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence into each pixel's generation process. Specifically, we adopt a self-similarity map to encode the coherence restrictions and employ it to extract coherent features from text embedding. Through visualization of our self-similarity map, we explore the essence of SCA, revealing that its effectiveness is not only in capturing reliable physical coherence patterns but also in enhancing complex texture generation. Extensive experiments demonstrate the superiority of our proposed method in both image generation quality and controllability.
翻译:布局到图像合成是条件图像生成中的新兴技术。它旨在生成复杂场景,使用户能够精细控制场景中物体的布局。然而,控制物体连贯性仍具有挑战性,包括语义连贯性(例如,猫是否注视花朵)和物理连贯性(例如,手与球拍不应错位)。本文提出了一种新颖的扩散模型,配备有效的全局语义融合模块和自相似性特征增强模块,以引导该任务中的物体连贯性。对于语义连贯性,我们认为图像描述包含丰富的信息,用于定义图像中物体之间的语义关系。传统的交叉注意力机制将高度相关的布局约束与语义连贯性分开处理,导致实验中出现不理想的结果。为此,我们开发了全局语义融合方法,将布局约束和语义连贯性需求的监督信号融合,并利用其指导图像合成过程。此外,为提升物理连贯性,我们设计了自相似性连贯注意力模块,将局部上下文物理连贯性显式融入每个像素的生成过程。具体而言,我们采用自相似性图编码连贯性约束,并从文本嵌入中提取连贯特征。通过自相似性图的可视化,我们揭示了自相似性连贯注意力的本质:其有效性不仅在于捕捉可靠的物理连贯性模式,还在于增强复杂纹理生成。大量实验证明了我们方法在图像生成质量和可控性方面的优越性。