In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.
翻译:尽管文本到图像生成领域日新月异,但合成和操控多个实体并遵循特定关系约束仍是一项持续挑战。本文提出了一种创新的渐进式合成与编辑操作,该操作系统地将实体纳入目标图像,确保每一步都遵循空间和关系约束。我们的关键洞察源于一个观察:预训练的文本到图像扩散模型虽能熟练处理一两个实体,但在面对更多实体时往往表现不佳。为解决这一限制,我们提出利用大型语言模型(LLM)将复杂冗长的文本描述分解为符合严格格式的连贯指令。为执行涉及不同语义操作(即插入、编辑和擦除)的指令,我们构建了刺激-响应-融合(SRF)框架。在该框架内,潜在区域根据每种操作被温和地刺激,随后融合响应性潜在成分以实现连贯的实体操控。我们提出的框架在对象合成方面取得了显著进展,尤其是在面对复杂冗长的文本输入时。因此,它为文本到图像生成任务设立了新基准,进一步提升了该领域的性能标准。