While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts. This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs. To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output.
翻译:尽管近期文本到图像生成模型的发展催生了一系列能够根据自由形式文本生成创意图像的高性能方法,但仍存在若干局限性。通过分析这些模型的交叉注意力表征,我们注意到两个关键问题。首先,对于包含多个概念的文本提示,不同概念对之间存在显著的像素空间重叠(即相同空间区域)。这最终导致模型无法区分两个概念,并在最终生成结果中忽略其中一个概念。其次,虽然这些模型在去噪初期(如前几步)试图捕捉所有此类概念(交叉注意力图可佐证),但该知识在去噪末期(如最后几步)未能得到保持。这种知识损失最终导致生成结果不准确。为解决这些问题,我们的核心创新包括两种测试时基于注意力的损失函数,显著提升了预训练基线文本到图像扩散模型的性能。首先,注意力分离损失减少了文本提示中不同概念注意力图之间的交叉注意力重叠,从而降低各概念间的混淆/冲突,最终实现生成结果中所有概念的完整捕捉。其次,注意力保持损失明确迫使文本到图像扩散模型在所有去噪时间步中保持所有概念的交叉注意力信息,从而减少信息损失并在生成结果中保留所有概念。