While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.
翻译:尽管扩散模型在美学图像合成方面展现出卓越能力,但其在复杂空间理解与推理方面仍常面临挑战。现有方法通过多模态大语言模型(MLLMs)来增强该能力,然而这些方案要么因联合训练产生高昂计算成本,要么在仅依赖文本提示时遭受空间信息损失。为缓解这些局限,我们提出空间思维链(SCoT)框架——一种即插即用方案,能有效桥接MLLMs的推理能力与扩散模型的生成能力。具体而言,我们首先通过交错文本-坐标指令格式的训练来增强扩散模型的布局感知能力,随后利用前沿MLLMs作为规划器生成综合布局方案,将其空间规划能力直接迁移至生成过程。大量实验表明,本方法在图像生成基准测试中达到最优性能,在复杂推理任务上显著超越基线模型,同时在图像编辑场景中也展现出强大效能。