Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generalizable forgery detection. Recently, large pre-trained Vision Transformers (ViTs) have shown promising generalization capability. In this paper, we propose the first parameter-efficient tuning approach for deepfake detection, namely DeepFake-Adapter, to effectively and efficiently adapt the generalizable high-level semantics from large pre-trained ViTs to aid deepfake detection. Given large pre-trained models but limited deepfake data, DeepFake-Adapter introduces lightweight yet dedicated dual-level adapter modules to a ViT while keeping the model backbone frozen. Specifically, to guide the adaptation process to be aware of both global and local forgery cues of deepfake data, 1) we not only insert Globally-aware Bottleneck Adapters in parallel to MLP layers of ViT, 2) but also actively cross-attend Locally-aware Spatial Adapters with features from ViT. Unlike existing deepfake detection methods merely focusing on low-level forgery patterns, the forgery detection process of our model can be regularized by generalizable high-level semantics from a pre-trained ViT and adapted by global and local low-level forgeries of deepfake data. Extensive experiments on several standard deepfake detection benchmarks validate the effectiveness of our approach. Notably, DeepFake-Adapter demonstrates a convincing advantage under cross-dataset and cross-manipulation settings. The source code is released at https://github.com/rshaojimmy/DeepFake-Adapter
翻译:现有深度伪造检测方法对未见或退化样本的泛化能力不足,这归因于对底层伪造模式的过拟合。本文认为高层语义同样是实现可泛化伪造检测不可或缺的关键要素。近年来,大规模预训练视觉Transformer(ViT)展现出优异的泛化能力。本文提出首个面向深度伪造检测的参数高效微调方法——DeepFake-Adapter,能够高效地从大规模预训练ViT中适配可泛化高层语义以辅助深度伪造检测。针对大规模预训练模型与有限深度伪造数据之间的矛盾,DeepFake-Adapter在保持模型骨干参数冻结的前提下,为ViT引入轻量级且专有的双层级适配器模块。具体而言,为引导适配过程感知深度伪造数据的全局与局部伪造线索:1)不仅在ViT的多层感知机(MLP)层中并行插入全局感知瓶颈适配器,2)而且使局部感知空间适配器与ViT特征主动进行交叉注意力交互。不同于现有深度伪造检测方法仅关注低层级伪造模式,本方法的伪造检测过程既受预训练ViT中可泛化高层语义的约束,又通过深度伪造数据的全局与局部低层级伪造特征进行适配。在多个标准深度伪造检测基准上的大量实验验证了本方法的有效性。值得注意的是,DeepFake-Adapter在跨数据集与跨操作场景下展现出显著优势。源代码已发布于https://github.com/rshaojimmy/DeepFake-Adapter。