The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion).
翻译:近期文本到图像扩散模型成功的关键在于其能够受复杂文本提示引导,从而让用户精准描述期望内容。然而,这些模型难以有效抑制在提示中明确要求从生成图像中省略的不期望内容的生成。本文分析了如何调控文本嵌入并从中移除不期望内容。我们提出两项贡献,分别称为**软权重正则化**和**推理时文本嵌入优化**。前者对文本嵌入矩阵进行正则化,有效抑制不期望内容;后者旨在进一步抑制提示中不期望内容的生成,同时促进期望内容的生成。我们通过大量实验对方法进行了定性和定量评估,验证了其有效性。此外,本方法可泛化应用于像素空间扩散模型(如DeepFloyd-IF)和潜在空间扩散模型(如Stable Diffusion)。