Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.
翻译:尽管文本到图像扩散模型在图像生成任务中表现出色,但近期研究指出其生成的图像有时无法准确捕捉文本提示的语义内容,这种现象通常被称为语义错位。为解决该问题,本文提出了一种新颖的基于能量的模型框架。具体而言,我们首先在去噪自编码器的每个交叉注意力层中,对潜在图像表示和文本嵌入构建对应的能量模型。随后,通过推导上下文向量对数后验的梯度信息,该梯度可被更新并传递至后续交叉注意力层,从而隐式地最小化能量函数的分层嵌套结构。此外,我们的潜在能量模型还支持零样本组合生成,通过将不同上下文对应的交叉注意力输出进行线性组合实现。通过大量实验证明,该方法在处理多概念生成、文本引导图像修复以及真实与合成图像编辑等多种图像生成任务中具有显著效果。