Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.
翻译:尽管文本到图像扩散模型在图像生成任务中表现出色,但近期研究指出,生成的图像有时无法捕捉文本提示中预期的语义内容,这种现象常被称为语义错位。为解决这一问题,本文提出了一种新颖的基于能量的模型(EBM)框架。具体而言,我们首先在去噪自编码器的每个交叉注意力层中,对潜在图像表征和文本嵌入构建基于能量的模型。随后,我们获取上下文向量对数后验的梯度,该梯度可被更新并传递至后续交叉注意力层,从而隐式地最小化一种嵌套式能量函数层级结构。我们的潜在EBM进一步允许通过不同上下文交叉注意力输出的线性组合实现零样本组合生成。通过大量实验证明,所提方法在处理多概念生成、文本引导的图像修复以及真实/合成图像编辑等多种图像生成任务中均表现出高效性。