Face Forgery Detection with Elaborate Backbone

Face Forgery Detection (FFD), or Deepfake detection, aims to determine whether a digital face is real or fake. Due to different face synthesis algorithms with diverse forgery patterns, FFD models often overfit specific patterns in training datasets, resulting in poor generalization to other unseen forgeries. This severe challenge requires FFD models to possess strong capabilities in representing complex facial features and extracting subtle forgery cues. Although previous FFD models directly employ existing backbones to represent and extract facial forgery cues, the critical role of backbones is often overlooked, particularly as their knowledge and capabilities are insufficient to address FFD challenges, inevitably limiting generalization. Therefore, it is essential to integrate the backbone pre-training configurations and seek practical solutions by revisiting the complete FFD workflow, from backbone pre-training and fine-tuning to inference of discriminant results. Specifically, we analyze the crucial contributions of backbones with different configurations in FFD task and propose leveraging the ViT network with self-supervised learning on real-face datasets to pre-train a backbone, equipping it with superior facial representation capabilities. We then build a competitive backbone fine-tuning framework that strengthens the backbone's ability to extract diverse forgery cues within a competitive learning mechanism. Moreover, we devise a threshold optimization mechanism that utilizes prediction confidence to improve the inference reliability. Comprehensive experiments demonstrate that our FFD model with the elaborate backbone achieves excellent performance in FFD and extra face-related tasks, i.e., presentation attack detection. Code and models are available at https://github.com/zhenglab/FFDBackbone.

翻译：人脸伪造检测（FFD），亦称深度伪造检测，旨在判断数字人脸图像的真实性或伪造性。由于不同的人脸合成算法产生多样化的伪造模式，FFD模型在训练数据集中往往过度拟合特定模式，导致对未见伪造类型的泛化能力较差。这一严峻挑战要求FFD模型具备表征复杂面部特征与提取细微伪造线索的强大能力。尽管现有FFD模型直接采用现成骨干网络进行面部伪造线索的表征与提取，但骨干网络的关键作用常被忽视——其现有知识与能力不足以应对FFD挑战，这必然限制了模型的泛化性能。因此，有必要整合骨干网络的预训练配置，并通过重新审视完整的FFD工作流程（从骨干网络预训练、微调到判别结果的推理）来寻求实用解决方案。具体而言，我们分析了不同配置的骨干网络在FFD任务中的关键作用，提出利用ViT网络在真实人脸数据集上进行自监督学习以预训练骨干网络，使其具备卓越的面部表征能力。随后，我们构建了一个竞争性骨干网络微调框架，通过竞争学习机制增强骨干网络提取多样化伪造线索的能力。此外，我们设计了一种基于预测置信度的阈值优化机制，以提升推理阶段的可靠性。综合实验表明，搭载精细骨干网络的FFD模型在FFD及相关人脸任务（如呈现攻击检测）中均取得优异性能。代码与模型已开源：https://github.com/zhenglab/FFDBackbone。