Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
翻译:文本到图像扩散模型在生成多样且高质量的图像方面展现出前所未有的能力。然而,当输入包含多个主体的复杂提示时,这些模型往往难以准确捕捉其预期语义。近期,大量布局到图像扩展方法被引入以提升用户控制力,旨在定位由特定标记表示的主体。然而,这些方法在处理语义或视觉上相似的多主体时,常生成语义不准确的图像。本研究深入探索并分析了这些局限性的成因。我们的研究发现,主要问题源于去噪过程中主体间无意的语义泄漏。这种泄漏归因于扩散模型的注意力层,该层倾向于融合不同主体的视觉特征。为解决此问题,我们提出了一种无需训练的受限注意力机制(Bounded Attention),用于限制采样过程中的信息流。该方法能够阻止主体间的有害泄漏,即便在复杂的多主体条件下,也能引导生成过程凸显每个主体的独特性。通过大量实验,我们证明该方法能够生成更符合给定提示和布局的多主体图像。