Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.
翻译:近年来,扩散模型在文本到图像生成领域取得了显著进展,但仅凭全局文本提示仍难以实现对图像中单个实体的细粒度控制。为应对这一局限,我们提出了EliGen,一种新颖的实体级可控图像生成框架。首先,我们提出了区域注意力机制,这是一种无需额外参数的扩散Transformer机制,能够无缝整合实体提示与任意形状的空间掩码。通过贡献一个具有细粒度空间与语义实体级标注的高质量数据集,我们训练EliGen实现了鲁棒且精确的实体级操控,在空间精度与图像质量方面均超越了现有方法。此外,我们提出了一种修复融合流程,将其能力扩展至多实体图像修复任务。我们进一步通过将其与IP-Adapter、In-Context LoRA及MLLM等其他开源模型集成,展示了其灵活性,从而解锁了新的创意可能性。源代码、模型及数据集已发布于https://github.com/modelscope/DiffSynth-Studio.git。