Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while DreamBooth often overlooks it. We attribute these issues to the incorrect learning of the embedding alignment for the concept. We introduce AttnDreamBooth, a novel approach that addresses these issues by separately learning the embedding alignment, the attention map, and the subject identity in different training stages. We also introduce a cross-attention map regularization term to enhance the learning of the attention map. Our method demonstrates significant improvements in identity preservation and text alignment compared to the baseline methods.
翻译:近年来,文本到图像模型的进展使得用户能够以灵活的文字控制,实现高质量个性化图像合成。本研究分析了文本到图像个性化中的两种主要技术——Textual Inversion与DreamBooth——的局限性。当将学习到的概念整合到新提示中时,Textual Inversion倾向于过拟合该概念,而DreamBooth则常忽略它。我们将这些问题归因于概念嵌入对齐的错误学习。我们提出了AttnDreamBooth,一种通过在不同训练阶段分别学习嵌入对齐、注意力图与主体身份来应对这些问题的创新方法。同时,我们引入了一项交叉注意力图正则化项,以增强注意力图的学习。与基线方法相比,我们的方法在身份保持与文本对齐方面表现出显著改进。