Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver.
翻译:文本到图像(T2I)扩散模型在个性化文本到图像生成任务中取得了显著成功,该任务旨在根据参考图像所示的人类身份生成新颖图像。尽管若干免调参方法已实现了令人满意的身份保真度,但它们通常面临过拟合问题:学习到的身份特征易与无关信息纠缠,导致文本可控性不足,尤其在面部区域表现明显。本文提出MasterWeaver——一种测试时免调参方法,可同时生成兼具可靠身份保真度与灵活可编辑性的个性化图像。具体而言,MasterWeaver采用编码器提取身份特征,并通过额外引入的交叉注意力机制引导图像生成过程。为在保持身份保真度的同时提升可编辑性,我们提出一种编辑方向损失函数进行训练,使MasterWeaver的编辑方向与原T2I模型保持一致。此外,构建了面部增强数据集以促进解耦身份学习,并进一步改善可编辑性。大量实验表明,MasterWeaver不仅能生成具有可靠身份特征的个性化图像,还在文本可控性方面展现出优越性能。我们的代码将开源至https://github.com/csyxwei/MasterWeaver。