Layout-to-image synthesis is an emerging technique in conditional image generation. It aims to generate complex scenes, where users require fine control over the layout of the objects in a scene. However, it remains challenging to control the object coherence, including semantic coherence (e.g., the cat looks at the flowers or not) and physical coherence (e.g., the hand and the racket should not be misaligned). In this paper, we propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules to guide the object coherence for this task. For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images. Instead of simply employing cross-attention between captions and generated images, which addresses the highly relevant layout restriction and semantic coherence separately and thus leads to unsatisfying results shown in our experiments, we develop GSF to fuse the supervision from the layout restriction and semantic coherence requirement and exploit it to guide the image synthesis process. Moreover, to improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence into each pixel's generation process. Specifically, we adopt a self-similarity map to encode the coherence restrictions and employ it to extract coherent features from text embedding. Through visualization of our self-similarity map, we explore the essence of SCA, revealing that its effectiveness is not only in capturing reliable physical coherence patterns but also in enhancing complex texture generation. Extensive experiments demonstrate the superiority of our proposed method in both image generation quality and controllability.
翻译:布局到图像合成是条件图像生成中的一项新兴技术,旨在生成复杂场景,其中用户需要对场景中对象的布局进行精细控制。然而,控制对象的连贯性(包括语义连贯性,如猫是否看向花朵,以及物理连贯性,如手和球拍不应错位)仍具有挑战性。本文提出了一种新颖的扩散模型,通过有效的全局语义融合(GSF)和自相似性特征增强模块来引导该任务中的对象连贯性。对于语义连贯性,我们认为图像描述包含丰富的信息,用于定义图像内对象之间的语义关系。我们并未简单采用描述与生成图像之间的交叉注意力(该方法会分别处理高度相关的布局约束和语义连贯性,从而导致实验中显示的不理想结果),而是开发了GSF模块,将布局约束和语义连贯性要求的监督信息融合,并利用其引导图像合成过程。此外,为提升物理连贯性,我们开发了自相似性连贯性注意力(SCA)模块,将局部上下文物理连贯性显式整合到每个像素的生成过程中。具体而言,我们采用自相似性图编码连贯性约束,并利用其从文本嵌入中提取连贯性特征。通过自相似性图的可视化,我们探索了SCA的本质,揭示其有效性不仅在于捕获可靠的物理连贯性模式,还在于增强复杂纹理生成。大量实验证明了该方法在图像生成质量和可控性方面的优越性。