Advancements in generative models have sparked significant interest in generating images while adhering to specific structural guidelines. Scene graph to image generation is one such task of generating images which are consistent with the given scene graph. However, the complexity of visual scenes poses a challenge in accurately aligning objects based on specified relations within the scene graph. Existing methods approach this task by first predicting a scene layout and generating images from these layouts using adversarial training. In this work, we introduce a novel approach to generate images from scene graphs which eliminates the need of predicting intermediate layouts. We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images. Towards this, we first pre-train our graph encoder to align graph features with CLIP features of corresponding images using a GAN based training. Further, we fuse the graph features with CLIP embedding of object labels present in the given scene graph to create a graph consistent CLIP guided conditioning signal. In the conditioning input, object embeddings provide coarse structure of the image and graph features provide structural alignment based on relationships among objects. Finally, we fine tune a pre-trained diffusion model with the graph consistent conditioning signal with reconstruction and CLIP alignment loss. Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks of COCO-stuff and Visual Genome dataset.
翻译:生成模型的进步激发了人们对在遵循特定结构约束下生成图像的浓厚兴趣。场景图到图像生成便是其中一项任务,旨在生成与给定场景图一致的图像。然而,视觉场景的复杂性使得根据场景图中指定的关系精确对齐对象面临挑战。现有方法通常先预测场景布局,再通过对抗训练从这些布局生成图像。本文提出一种从场景图生成图像的新方法,无需预测中间布局。我们利用预训练的文本到图像扩散模型和CLIP引导,将图知识转化为图像。为此,我们首先预训练图编码器,通过基于GAN的训练将图特征与对应图像的CLIP特征对齐。进一步,我们将图特征与给定场景图中对象标签的CLIP嵌入融合,生成图一致的CLIP引导条件信号。在条件输入中,对象嵌入提供图像的粗略结构,而图特征则基于对象间关系实现结构对齐。最后,我们结合重建损失和CLIP对齐损失,用图一致的条件信号微调预训练扩散模型。大量实验表明,我们的方法在COCO-stuff和Visual Genome数据集的标准基准上优于现有方法。