Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.
翻译:现有文本到图像模型难以遵循复杂文本提示,亟需额外接地输入以提升可控性。本研究提出将场景分解为视觉基元(称为稠密斑点表示),该表示既包含场景细粒度细节,又具有模块化、可解释且易于构建的特性。基于斑点表示,我们开发了斑点接地文本到图像扩散模型BlobGEN,用于组合式生成。具体而言,我们引入新型掩码交叉注意力模块以解耦斑点表示与视觉特征的融合。为利用大型语言模型(LLM)的组合能力,我们提出基于上下文学习的斑点表示生成方法,可直接从文本提示生成斑点表示。大量实验表明,BlobGEN在MS-COCO数据集上实现了更优的零样本生成质量与布局引导可控性。经LLM增强后,本方法在组合式图像生成基准测试中展现出卓越的数值与空间正确性。项目页面:https://blobgen-2d.github.io。