This paper presents a novel approach, called Prototype-based Self-Distillation (ProS), for unsupervised face representation learning. The existing supervised methods heavily rely on a large amount of annotated training facial data, which poses challenges in terms of data collection and privacy concerns. To address these issues, we propose ProS, which leverages a vast collection of unlabeled face images to learn a comprehensive facial omni-representation. In particular, ProS consists of two vision-transformers (teacher and student models) that are trained with different augmented images (cropping, blurring, coloring, etc.). Besides, we build a face-aware retrieval system along with augmentations to obtain the curated images comprising predominantly facial areas. To enhance the discrimination of learned features, we introduce a prototype-based matching loss that aligns the similarity distributions between features (teacher or student) and a set of learnable prototypes. After pre-training, the teacher vision transformer serves as a backbone for downstream tasks, including attribute estimation, expression recognition, and landmark alignment, achieved through simple fine-tuning with additional layers. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various tasks, both in full and few-shot settings. Furthermore, we investigate pre-training with synthetic face images, and ProS exhibits promising performance in this scenario as well.
翻译:摘要:本文提出一种名为原型自蒸馏(Prototype-based Self-Distillation, ProS)的新型无监督人脸表示学习方法。现有监督方法严重依赖大量标注的人脸训练数据,这在数据采集和隐私保护方面存在挑战。为解决这些问题,我们提出ProS方法,利用海量无标注人脸图像学习全面的人脸全息表示。具体而言,ProS由两个视觉Transformer(教师模型和学生模型)组成,它们通过不同的增强图像(裁剪、模糊、着色等)进行训练。此外,我们构建了一个人脸感知检索系统,结合数据增强技术获取以人脸区域为主的精选图像。为增强学习特征的判别性,我们引入基于原型的匹配损失,该损失通过对齐特征(教师或学生)与一组可学习原型之间的相似度分布进行优化。预训练完成后,教师视觉Transformer可作为下游任务(包括属性估计、表情识别和关键点对齐)的骨干网络,仅需通过附加层进行简单微调即可。大量实验表明,我们的方法在全量学习和少样本学习场景下的多项任务中均达到最优性能。此外,我们还探索了基于合成人脸图像的预训练,ProS在此场景下同样展现出优异性能。