Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation

Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.

翻译：先进的基于扩散的文本到图像（T2I）模型（如稳定扩散模型）在仅使用文本提示生成多样化和高质量图像方面取得了显著进展。然而，当非知名用户需要为其身份（ID）进行个性化图像生成时，T2I模型无法准确生成与其身份相关的图像。主要问题在于，预训练的T2I模型未能学习新身份提示与其对应视觉内容之间的映射。先前的方法要么无法精确拟合面部区域，要么丧失了与T2I模型中其他现有概念的交互生成能力。换言之，它们无法为包含场景（如“埃菲尔铁塔”）、动作（如“手持篮球”）和面部属性（如“闭眼”）等其他概念的给定提示生成与T2I对齐且语义保真的图像。本文聚焦于将准确且可交互的身份嵌入插入稳定扩散模型，以实现语义保真个性化生成。我们从两个角度解决这一挑战：面部区域拟合与语义保真令牌优化。具体而言，我们首先可视化注意力过拟合问题，并提出一种面部注意力损失函数来拟合面部区域，避免纠缠与身份无关的信息（如面部布局和背景）。这一关键技巧显著提升了身份准确性和与其他现有概念的交互生成能力。随后，我们将一个身份表示优化为多阶段令牌，每个令牌包含两个解耦特征。这种文本条件空间的扩展增强了语义保真控制。大量实验验证表明，与先前方法相比，我们的结果在身份准确性、基于文本的操作能力以及泛化性能方面均表现出优越性。