Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.
翻译:近期大规模预训练文本到图像模型的进展展示了前所未有的高质量人像生成能力,然而定制人脸身份仍是一个棘手问题。现有方法即便在训练时为每个主体提供多张图像,也无法确保稳定的身份保持与灵活的编辑性。本文提出StableIdentity方法,仅需一张人脸图像即可实现身份一致的场景重构。具体而言,我们采用具备身份先验的人脸编码器对输入人脸进行编码,随后将人脸表征投射至基于名人姓名构建的具备可编辑先验的空间中。通过融合身份先验与可编辑先验,习得的身份特征可被注入至各类不同场景。此外,我们设计了掩蔽式两阶段扩散损失函数,在增强输入人脸像素级感知的同时保持生成多样性。大量实验表明,本方法性能优于现有定制化方法。同时,习得的身份特征可灵活与ControlNet等现成模块结合。值得注意的是,据我们所知,这是首次实现将单张图像习得的身份直接注入视频/3D生成而无需微调。我们认为,所提出的StableIdentity是统一图像、视频与3D定制化生成模型的重要一步。