Enhancing Object Coherence in Layout-to-Image Synthesis

Layout-to-image synthesis is an emerging technique in conditional image generation. It aims to generate complex scenes, where users require fine control over the layout of the objects in a scene. However, it remains challenging to control the object coherence, including semantic coherence (e.g., the cat looks at the flowers or not) and physical coherence (e.g., the hand and the racket should not be misaligned). In this paper, we propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules to guide the object coherence for this task. For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images. Instead of simply employing cross-attention between captions and generated images, which addresses the highly relevant layout restriction and semantic coherence separately and thus leads to unsatisfying results shown in our experiments, we develop GSF to fuse the supervision from the layout restriction and semantic coherence requirement and exploit it to guide the image synthesis process. Moreover, to improve the physical coherence, we develop a Self-similarity Coherence Attention (SCA) module to explicitly integrate local contextual physical coherence into each pixel's generation process. Specifically, we adopt a self-similarity map to encode the coherence restrictions and employ it to extract coherent features from text embedding. Through visualization of our self-similarity map, we explore the essence of SCA, revealing that its effectiveness is not only in capturing reliable physical coherence patterns but also in enhancing complex texture generation. Extensive experiments demonstrate the superiority of our proposed method in both image generation quality and controllability.

翻译：布局到图像合成是条件图像生成中的一项新兴技术，旨在生成复杂场景，其中用户需要对场景中对象的布局进行精细控制。然而，控制对象的连贯性（包括语义连贯性，如猫是否看向花朵，以及物理连贯性，如手和球拍不应错位）仍具有挑战性。本文提出了一种新颖的扩散模型，通过有效的全局语义融合（GSF）和自相似性特征增强模块来引导该任务中的对象连贯性。对于语义连贯性，我们认为图像描述包含丰富的信息，用于定义图像内对象之间的语义关系。我们并未简单采用描述与生成图像之间的交叉注意力（该方法会分别处理高度相关的布局约束和语义连贯性，从而导致实验中显示的不理想结果），而是开发了GSF模块，将布局约束和语义连贯性要求的监督信息融合，并利用其引导图像合成过程。此外，为提升物理连贯性，我们开发了自相似性连贯性注意力（SCA）模块，将局部上下文物理连贯性显式整合到每个像素的生成过程中。具体而言，我们采用自相似性图编码连贯性约束，并利用其从文本嵌入中提取连贯性特征。通过自相似性图的可视化，我们探索了SCA的本质，揭示其有效性不仅在于捕获可靠的物理连贯性模式，还在于增强复杂纹理生成。大量实验证明了该方法在图像生成质量和可控性方面的优越性。

相关内容

SCA

关注 0

计算机动画研讨会（SCA）是计算机动画的理论和实践创新的主要论坛。SCA将从事基于时间现象的各个方面的学术研究人员和行业研究人员以及从业人员汇聚在一起，提供了一个私密的环境，鼓励社区互动，促进研究成果的交流，激发未来的想法并建立新的合作关系。官网地址：http://dblp.uni-trier.de/db/conf/sca/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日