MUSE：通过显式布局语义扩展实现多主体统一合成 (MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion)

Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.

翻译：现有的文本到图像扩散模型已展现出在文本提示引导下生成高质量图像的卓越能力。然而，实现具有精确空间控制的多主体组合合成仍然是一个重大挑战。在本工作中，我们致力于解决布局可控的多主体合成任务，该任务既需要忠实重建参考主体，又需将其精确放置在统一图像内的指定区域。尽管近期进展已分别提升了布局控制和主体合成能力，但现有方法难以在此复合任务中同时满足空间精度和身份保持的双重要求。为弥补这一差距，我们提出了MUSE，一个通过显式语义空间扩展无缝整合布局规范与文本引导的统一合成框架。所提出的拼接交叉注意力机制能够在不产生干扰的情况下实现空间约束与文本描述之间的双向模态对齐。此外，我们设计了一种渐进式两阶段训练策略，将LMS任务分解为可学习的子目标以实现有效优化。大量实验表明，与现有解决方案相比，MUSE能够以卓越的空间精度和身份一致性实现零样本端到端生成，推动了可控图像合成的前沿发展。我们的代码与模型已在https://github.com/pf0607/MUSE开源。