Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: https://github.com/YangLing0818/RPG-DiffusionMaster
翻译:扩散模型在文本到图像生成与编辑中展现出卓越性能。然而,现有方法在处理涉及多个对象及多重属性与关系的复杂文本提示时仍面临挑战。本文提出一种全新的免训练文本到图像生成/编辑框架——重述、规划与生成(RPG),通过利用多模态大语言模型(MLLM)强大的思维链推理能力,增强文本到图像扩散模型的组合能力。该方法将MLLM作为全局规划器,将复杂图像生成过程分解为多个子区域内的简化生成任务。我们提出互补区域扩散机制,以实现区域感知的组合式生成。此外,我们将文本引导的图像生成与编辑以闭环方式集成到所提RPG框架中,从而增强泛化能力。大量实验表明,我们的RPG在性能上超越包括DALL-E 3和SDXL在内的最先进文本到图像扩散模型,尤其在多类别目标组合与文本-图像语义对齐方面表现突出。值得注意的是,RPG框架展现出与多种MLLM架构(如MiniGPT-4)和扩散骨干网络(如ControlNet)的高度兼容性。我们的代码已开源:https://github.com/YangLing0818/RPG-DiffusionMaster