AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process. The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt. Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, AlignIT, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged. Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality.

翻译：本文研究利用用户提供的参考图像定制文本到图像扩散模型的问题。现有方法在面对新提示时能够捕捉参考图像的关键概念，但无法使生成图像与提示要求对齐。本工作旨在通过提出可与现有定制方法协同使用的新方案解决这一核心问题——现有方法通常在文本编码过程的不同中间阶段优化嵌入表示或权重参数。本论文的第一项贡献是对文本到图像模型条件向量生成前的文本编码各阶段进行解构分析。通过对现有定制方法的整体观察，我们发现该过程输出的键值对与其对应基线（未定制）模型（如基础版Stable Diffusion）存在显著差异。虽然这种差异不影响被定制概念的表达，却导致生成图像的其他部分与提示要求失准。进一步研究发现，这些键值对可独立控制最终生成结果的各个方面，实现对输出的语义操控。综合来看，跨越这些键值对的特征构成了我们第二项贡献的基础：针对现有方法的上述缺陷，我们提出新型后处理算法AlignIT。该算法在注入目标概念对应键值对的同时，确保输入提示中所有其他标记的键值对保持不变。所提方法可直接嵌入现有定制流程，在保持定制质量的前提下，显著提升最终结果与输入提示的对齐性能。