Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.
翻译:多实例生成在空间布局和属性绑定方面已取得显著进展。然而,现有方法在细粒度语义理解方面仍面临挑战,尤其是在处理复杂文本描述时。为克服这些限制,我们提出了DEIG,一种用于细粒度可控多实例生成的新型框架。DEIG集成了实例细节提取器(IDE),可将文本编码器嵌入转换为紧凑的实例感知表示,以及细节融合模块(DFM),该模块应用基于实例的掩码注意力机制以防止属性在实例间泄漏。这些组件使DEIG能够生成视觉连贯的多实例场景,并精确匹配丰富的局部化文本描述。为支持细粒度监督,我们构建了一个高质量数据集,其中包含由视觉语言模型生成的详细组合式实例描述。我们还引入了DEIG-Bench,这是一个包含区域级标注和面向人物与物体的多属性提示的新基准测试。实验表明,在空间一致性、语义准确性和组合泛化能力方面,DEIG在多个基准测试中均持续优于现有方法。此外,DEIG可作为即插即用模块,易于集成到标准的基于扩散的生成流程中。