Diffusion models excel in image generation but lack detailed semantic control using text prompts. Additional techniques have been developed to address this limitation. However, conditioning diffusion models solely on text-based descriptions is challenging due to ambiguity and lack of structure. In contrast, scene graphs offer a more precise representation of image content, making them superior for fine-grained control and accurate synthesis in image generation models. The amount of image and scene-graph data is sparse, which makes fine-tuning large diffusion models challenging. We propose multiple approaches to tackle this problem using ControlNet and Gated Self-Attention. We were able to show that using out proposed methods it is possible to generate images from scene graphs with much higher quality, outperforming previous methods. Our source code is publicly available on https://github.com/FrankFundel/SGCond
翻译:扩散模型在图像生成方面表现出色,但缺乏基于文本提示的精细语义控制。为解决这一局限,已开发出多种辅助技术。然而,仅依赖文本描述的条件扩散模型因文本的模糊性和缺乏结构而面临挑战。相比之下,场景图能更精确地表示图像内容,因此在图像生成模型中,其对于细粒度控制和准确合成具有显著优势。由于图像与场景图数据相对稀缺,微调大型扩散模型面临困难。我们提出多种方法,通过使用ControlNet和门控自注意力机制来解决这一问题。实验表明,采用我们提出的方法,可以从场景图中生成质量更高的图像,性能优于先前方法。我们的源代码已在https://github.com/FrankFundel/SGCond上公开。