This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.
翻译:本文提出了Arc2Face,一种基于身份条件的人脸基础模型。给定某人的ArcFace嵌入向量,该模型能生成多样化的逼真图像,在面部相似度上达到超越现有模型的空前水平。尽管先前已有尝试将人脸识别特征解码为精细图像,但我们发现常见的高分辨率数据集(如FFHQ)缺乏足够身份类别来重建任意对象。为此,我们精心对最大公开人脸识别(FR)数据集WebFace42M的关键部分进行了上采样处理。Arc2Face基于预训练的稳定扩散模型构建,但将其适配为仅以身份向量为条件的身份到人脸生成任务。不同于近期将身份与文本嵌入相结合以实现文本到图像模型零样本个性化的工作,我们强调人脸识别特征的紧凑性——这些特征能完整捕捉人脸本质,而无需依赖人工设计的提示词。关键问题在于,文本增强模型难以解耦身份与文本信息,通常需要描述给定人脸细节才能达到满意的相似度。而Arc2Face仅需利用ArcFace的判别性特征即可引导生成过程,为身份一致性至关重要的多项任务提供稳健先验。作为实例,我们使用本模型生成的合成图像训练人脸识别模型,其性能超越现有合成数据集。