Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
翻译:最近,将视觉控制集成到文本到图像(T2I)模型中的方法(如ControlNet)因具备更精细的控制能力而备受关注。尽管各种训练无关方法致力于增强T2I模型的提示跟随能力,但视觉控制的问题仍鲜有研究,尤其是在视觉控制与文本提示不一致的场景下。本文针对"视觉控制下的提示跟随"这一挑战,提出了一种名为掩码引导提示跟随(MGPF)的训练无关方法。该方法引入目标掩码来区分视觉控制与提示中对齐和不一致的部分,同时设计了一个名为掩码ControlNet的网络,利用这些目标掩码在不一致的视觉控制区域进行目标生成。此外,为提升属性匹配效果,我们设计了一种简单高效的损失函数,用于将属性的注意力图与ControlNet及目标掩码约束下的目标区域对齐。通过全面的定量和定性实验,验证了MGPF的有效性和优越性。