Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.
翻译:无分类器引导(CFG)已广泛应用于文本到图像扩散模型,其中引入CFG尺度来控制整个图像空间上文本引导的强度。然而,我们认为全局CFG尺度会导致不同语义强度上的空间不一致性以及次优的图像质量。为解决这一问题,我们提出了一种新方法——语义感知无分类器引导(S-CFG),用于定制文本到图像扩散模型中不同语义单元的引导程度。具体而言,我们首先设计了一种无需训练的语义分割方法,在每个去噪步骤中将潜在图像划分为相对独立的语义区域。特别地,去噪U-Net骨干中的交叉注意力图被重新归一化,以将每个补丁分配给对应的标记,而自注意力图则用于补全语义区域。然后,为了平衡不同语义单元的放大效果,我们自适应地调整不同语义区域上的CFG尺度,将文本引导程度重新缩放至统一水平。最后,大量实验表明,在各种文本到图像扩散模型上,S-CFG优于原始CFG策略,且无需任何额外训练成本。我们的代码可在https://github.com/SmilesDZgk/S-CFG获取。