Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose a new training-free and transferred-friendly text-to-image generation framework, namely RealCompo, which aims to leverage the advantages of text-to-image and layout-to-image models to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and layout-to-image models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Code is available at https://github.com/YangLing0818/RealCompo
翻译:扩散模型在文本到图像生成领域取得了显著进展。然而,现有模型在面对多物体组合生成时仍存在诸多困难。本文提出了一种全新的免训练、易迁移的文本到图像生成框架,即RealCompo,旨在利用文本到图像模型和布局到图像模型的优势,增强生成图像的现实性与构图性。我们设计了一种直观且新颖的平衡器,可在去噪过程中动态平衡两种模型的优势,实现任意模型的即插即用而无需额外训练。大量实验表明,我们的RealCompo在多物体组合生成中始终优于最先进的文本到图像模型和布局到图像模型,同时保持生成图像令人满意的现实性与构图性。代码开源在https://github.com/YangLing0818/RealCompo