Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models.
翻译:近年来扩散模型的进展展示了其生成视觉上引人注目的图像的卓越能力。然而,确保生成图像与给定提示词之间的紧密匹配仍然是一个持续挑战。在这项工作中,我们识别出导致文本-图像不匹配问题的关键因素是提示词与输出图像之间跨模态关系学习的不足。为了更好地对齐提示词与图像内容,我们利用一种自适应掩码来改进交叉注意力机制,该掩码以注意力图和提示词嵌入为条件,动态调整每个文本标记对图像特征的贡献。该机制显著减弱了文本编码器中语义信息嵌入的歧义性,从而提升了合成图像中文本到图像的一致性。我们的方法名为MaskDiffusion,无需训练且可热插拔,适用于流行的预训练扩散模型。当应用于潜在扩散模型时,与原始扩散模型相比,我们的MaskDiffusion在忽略不计的计算开销下能显著提升文本到图像的一致性。