Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.
翻译:扩散模型已展现出从文本提示合成高质量多样化图像的能力。然而,同时控制全局语境(如物体布局与交互)和局部细节(如颜色与情感)仍是一项重大挑战。模型常难以理解涉及多物体的复杂描述,将指定视觉属性错误关联至不相关目标或直接忽略。本文提出全局-局部扩散框架GLoD(Global-Local Diffusion),无需训练或微调即可在文本生成图像中同时控制全局语境与局部细节。该框架将多个全局与局部提示分别指派至对应层级,通过预训练扩散模型融合噪声以引导去噪过程。其支持复杂全局-局部组合,在保留其他未指定身份特征的同时,将全局提示中的物体与局部提示进行条件关联。定量与定性评估表明,GLoD能有效生成既符合用户指定的物体交互关系又满足物体细节的复杂图像。