Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. To address the absence of a large-scale, high-quality dataset for this task, we introduce IMIG-100K, the first dataset to provide detailed layout and identity annotations specifically designed for Multi-Instance Generation. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods especially in layout control and identity fidelity.
翻译:多实例图像生成(MIG)对现代扩散模型而言仍是一个重大挑战,主要因其在实现精确的对象布局控制与保持多个不同主体身份一致性方面存在关键局限。为应对这些局限,我们提出了ContextGen,一种新颖的、由布局和参考图像共同引导的、用于多实例生成的扩散Transformer框架。我们的方法集成了两项关键技术贡献:一是上下文布局锚定(CLA)机制,该机制将复合布局图像整合到生成上下文中,以稳健地将对象锚定在期望位置;二是身份一致性注意力(ICA),这是一种创新的注意力机制,它利用上下文参考图像来确保多个实例的身份一致性。针对该任务缺乏大规模高质量数据集的问题,我们引入了IMIG-100K,这是首个专门为多实例生成提供详细布局与身份标注的数据集。大量实验表明,ContextGen确立了新的性能标杆,尤其在布局控制和身份保真度方面超越了现有方法。