With the advancement in face manipulation technologies, the importance of face forgery detection in protecting authentication integrity becomes increasingly evident. Previous Vision Transformer (ViT)-based detectors have demonstrated subpar performance in cross-database evaluations, primarily because fully fine-tuning with limited Deepfake data often leads to forgetting pre-trained knowledge and over-fitting to data-specific ones. To circumvent these issues, we propose a novel Forgery-aware Adaptive Vision Transformer (FA-ViT). In FA-ViT, the vanilla ViT's parameters are frozen to preserve its pre-trained knowledge, while two specially designed components, the Local-aware Forgery Injector (LFI) and the Global-aware Forgery Adaptor (GFA), are employed to adapt forgery-related knowledge. our proposed FA-ViT effectively combines these two different types of knowledge to form the general forgery features for detecting Deepfakes. Specifically, LFI captures local discriminative information and incorporates these information into ViT via Neighborhood-Preserving Cross Attention (NPCA). Simultaneously, GFA learns adaptive knowledge in the self-attention layer, bridging the gap between the two different domain. Furthermore, we design a novel Single Domain Pairwise Learning (SDPL) to facilitate fine-grained information learning in FA-ViT. The extensive experiments demonstrate that our FA-ViT achieves state-of-the-art performance in cross-dataset evaluation and cross-manipulation scenarios, and improves the robustness against unseen perturbations.
翻译:随着人脸操纵技术的进步,人脸伪造检测在维护认证完整性方面的重要性日益凸显。以往的基于视觉Transformer的检测器在跨数据库评估中表现不佳,主要原因是在有限的Deepfake数据上进行完全微调往往会导致预训练知识遗忘以及对特定数据的过拟合。为解决这些问题,我们提出了一种新颖的面向伪造感知的自适应视觉Transformer。在该方法中,原始ViT的参数被冻结以保留其预训练知识,同时通过两个专门设计的组件——局部感知伪造注入器和全局感知伪造适配器来适应伪造相关知识。我们提出的FA-ViT有效融合了这两类不同知识,以形成用于检测Deepfakes的通用伪造特征。具体而言,LFI通过邻域保持交叉注意力捕获局部判别性信息,并将这些信息整合到ViT中。与此同时,GFA在自注意力层学习自适应知识,弥合了两个不同领域之间的差距。此外,我们还设计了一种新颖的单域成对学习机制,以促进FA-ViT中的细粒度信息学习。大量实验表明,我们的FA-ViT在跨数据集评估和跨操纵场景中均取得了最先进的性能,并增强了对未知扰动的鲁棒性。