CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (e.g., SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To Inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. Our code, model, and dataset will be available at https://creatilayout.github.io.

翻译：扩散模型因其能够生成不仅视觉吸引力强且具有高艺术质量的图像而受到认可。因此，布局到图像（L2I）生成被提出，旨在利用区域特定的位置和描述来实现更精确和可控的生成。然而，先前的方法主要集中于基于UNet的模型（例如SD1.5和SDXL），对多模态扩散Transformer（MM-DiT）的探索有限，而后者已展现出强大的图像生成能力。使MM-DiT适用于布局到图像生成看似直接，但由于布局在多个模态间如何引入、整合和平衡的复杂性，这具有挑战性。为此，我们探索了多种网络变体以高效地将布局引导融入MM-DiT，并最终提出了SiamLayout。为了继承MM-DiT的优势，我们使用一组独立的网络权重来处理布局，将其视为与图像和文本模态同等重要。同时，为了缓解模态间的竞争，我们将图像-布局交互解耦为一个孪生分支，与图像-文本分支并行，并在后期阶段融合它们。此外，我们贡献了一个大规模布局数据集，命名为LayoutSAM，包含270万图像-文本对和1070万个实体。每个实体都标注有边界框和详细描述。我们进一步构建了LayoutSAM-Eval基准作为评估L2I生成质量的综合工具。最后，我们介绍了布局设计师，它挖掘了大型语言模型在布局规划中的潜力，将其转变为布局生成和优化的专家。我们的代码、模型和数据集将在https://creatilayout.github.io 上提供。