This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
翻译:本研究提出一个核心问题:在拥有大量未标注真实人脸数据的情况下,如何学习一种鲁棒且可迁移的面部表征,以提升各类人脸安全任务在泛化性能上的表现?我们首次尝试提出一种自监督预训练框架FSFM,用于学习真实人脸图像的基础表征。该框架通过掩码图像建模与实例判别之间的协同作用,探索了多种面向人脸任务的掩码策略,并提出了一种简洁而高效的CRFR-P掩码方法,该方法显式地迫使模型学习有意义的区域内一致性及具有挑战性的区域间连贯性。此外,我们设计了与掩码图像建模自然耦合的实例判别网络,通过定制的自蒸馏机制建立从局部到全局的底层对应关系。这三个学习目标(即3C)共同赋能模型编码真实人脸的局部特征与全局语义。预训练完成后,一个标准ViT即可作为通用的视觉基础模型,应用于下游人脸安全任务:跨数据集深度伪造检测、跨域人脸活体检测以及未知扩散模型生成的面部伪造检测。在10个公开数据集上的大量实验表明,我们的模型在迁移能力上优于监督预训练方法、通用视觉及人脸自监督学习方法,甚至在某些任务上超越了专门设计的当前最优方法。