The increasing realism and accessibility of deepfakes have raised critical concerns about media authenticity and information integrity. Despite recent advances, deepfake detection models often struggle to generalize beyond their training distributions, particularly when applied to media content found in the wild. In this work, we present a robust video deepfake detection framework with strong generalization that takes advantage of the rich facial representations learned by face foundation models. Our method is built on top of FSFM, a self-supervised model trained on real face data, and is further fine-tuned using an ensemble of deepfake datasets spanning both face-swapping and face-reenactment manipulations. To enhance discriminative power, we incorporate triplet loss variants during training, guiding the model to produce more separable embeddings between real and fake samples. Additionally, we explore attribution-based supervision schemes, where deepfakes are categorized by manipulation type or source dataset, to assess their impact on generalization. Extensive experiments across diverse evaluation benchmarks demonstrate the effectiveness of our approach, especially in challenging real-world scenarios.
翻译:深度伪造技术日益逼真且易于获取,对媒体真实性与信息完整性构成了严峻挑战。尽管近期研究取得了进展,但深度伪造检测模型往往难以泛化至训练分布之外的数据,尤其是在处理现实场景中的媒体内容时。本研究提出一种具有强泛化能力的鲁棒视频深度伪造检测框架,该框架充分利用了人脸基础模型所学习的丰富面部表征。我们的方法基于FSFM(一种在真实人脸数据上训练的自监督模型)构建,并通过整合涵盖人脸替换与人脸重演操作的多种深度伪造数据集进行微调。为增强判别能力,我们在训练中引入了三元组损失变体,引导模型生成更具可分性的真实与伪造样本嵌入。此外,我们探索了基于属性标注的监督方案(将深度伪造按操纵类型或来源数据集分类),以评估其对泛化性能的影响。在多样化评估基准上的大量实验证明了本方法的有效性,尤其是在具有挑战性的现实场景中。