While text-to-image diffusion models can generate highquality images from textual descriptions, they generally lack fine-grained control over the visual composition of the generated images. Some recent works tackle this problem by training the model to condition the generation process on additional input describing the desired image layout. Arguably the most popular among such methods, ControlNet, enables a high degree of control over the generated image using various types of conditioning inputs (e.g. segmentation maps). However, it still lacks the ability to take into account localized textual descriptions that indicate which image region is described by which phrase in the prompt. In this work, we show the limitations of ControlNet for the layout-to-image task and enable it to use localized descriptions using a training-free approach that modifies the crossattention scores during generation. We adapt and investigate several existing cross-attention control methods in the context of ControlNet and identify shortcomings that cause failure (concept bleeding) or image degradation under specific conditions. To address these shortcomings, we develop a novel cross-attention manipulation method in order to maintain image quality while improving control. Qualitative and quantitative experimental studies focusing on challenging cases are presented, demonstrating the effectiveness of the investigated general approach, and showing the improvements obtained by the proposed cross-attention control method.
翻译:尽管文本到图像扩散模型能够从文本描述中生成高质量图像,但它们通常缺乏对生成图像视觉构成的精细控制。近期一些研究工作通过训练模型使其生成过程依赖于额外输入(描述所需图像布局)来应对这一问题。其中最为流行的方法之一ControlNet,能够利用多种条件输入(如分割图)实现对生成图像的高度控制。然而,该方法仍无法处理指示提示词中每个短语对应图像区域的局部文本描述。本研究揭示了ControlNet在布局到图像任务中的局限性,并提出一种无需训练的方法——通过修改生成过程中的交叉注意力分数——使其能够利用局部描述。我们在ControlNet框架下适配并研究了多种现有交叉注意力控制方法,发现了特定条件下导致失败(概念混淆)或图像质量下降的缺陷。为解决这些问题,我们开发了一种新型交叉注意力操控方法,在提升控制能力的同时维持图像质量。本文针对具有挑战性的案例进行了定性与定量实验研究,证明了所研究通用方法的有效性,并展示了所提交叉注意力控制方法带来的改进。