Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver.
翻译:文本到图像(T2I)扩散模型在个性化文本到图像生成任务中取得了显著成功,该任务旨在根据参考图像中的人类身份生成新颖图像。尽管多种免微调方法已实现令人满意的身份保真度,但它们通常面临过拟合问题。习得的身份特征易与无关信息纠缠,导致文本可控性不佳,尤其是在人脸区域。本文提出了MasterWeaver——一种测试时免微调方法,能够在生成个性化图像时同时保证身份保真度与灵活可编辑性。具体而言,MasterWeaver采用编码器提取身份特征,并通过额外引入的交叉注意力机制引导图像生成。为在保持身份保真度的同时提升可编辑性,我们提出编辑方向损失函数进行训练,使MasterWeaver的编辑方向与原始T2I模型的编辑方向对齐。此外,我们构建了人脸增强数据集以促进解耦身份学习,并进一步提升可编辑性。大量实验表明,我们的MasterWeaver不仅能生成身份保真度高的个性化图像,在文本可控性方面也展现出优越性能。代码将在https://github.com/csyxwei/MasterWeaver 公开。