DeepfakeMAE: Facial Part Consistency Aware Masked Autoencoder for Deepfake Video Detection

Deepfake techniques have been used maliciously, resulting in strong research interests in developing Deepfake detection methods. Deepfake often manipulates the video content by tampering with some facial parts. However, this manipulation usually breaks the consistency among facial parts, e.g., Deepfake may change smiling lips to upset, but the eyes are still smiling. Existing works propose to spot inconsistency on some specific facial parts (e.g., lips), but they may perform poorly if new Deepfake techniques focus on the specific facial parts used by the detector. Thus, this paper proposes a new Deepfake detection model, DeepfakeMAE, which can utilize the consistencies among all facial parts. Specifically, given a real face image, we first pretrain a masked autoencoder to learn facial part consistency by randomly masking some facial parts and reconstructing missing areas based on the remaining facial parts. Furthermore, to maximize the discrepancy between real and fake videos, we propose a novel model with dual networks that utilize the pretrained encoder and decoder, respectively. 1) The pretrained encoder is finetuned for capturing the overall information of the given video. 2) The pretrained decoder is utilized for distinguishing real and fake videos based on the motivation that DeepfakeMAE's reconstruction should be more similar to a real face image than a fake one. Our extensive experiments on standard benchmarks demonstrate that DeepfakeMAE is highly effective and especially outperforms the previous state-of-the-art method by 3.1% AUC on average in cross-dataset detection.

翻译：深度伪造技术被恶意使用，引发了开发深度伪造检测方法的强烈研究兴趣。深度伪造通常通过篡改某些面部部分来操纵视频内容。然而，这种操纵往往会破坏面部各部分之间的一致性，例如，深度伪造可能将微笑的嘴唇改成不悦，但眼睛仍保持微笑状态。现有研究提出检测特定面部部分（如嘴唇）的不一致性，但如果新的深度伪造技术聚焦于检测器所使用的特定面部部分，这些方法可能表现不佳。因此，本文提出了一种新的深度伪造检测模型DeepfakeMAE，该模型能够利用所有面部部分之间的一致性。具体而言，给定一张真实人脸图像，我们首先预训练一个掩码自编码器，通过随机掩码一些面部部分并基于剩余的面部部分重建缺失区域来学习面部部分一致性。此外，为了最大化真实视频与伪造视频之间的差异，我们提出了一种新颖的双网络模型，分别利用预训练的编码器和解码器。1）预训练的编码器经过微调，用于捕捉给定视频的整体信息。2）预训练的解码器用于区分真实与伪造视频，其动机是DeepfakeMAE的重建结果应更接近真实人脸图像而非伪造图像。我们在标准基准上的大量实验表明，DeepfakeMAE非常有效，尤其在跨数据集检测中，平均AUC比之前的最先进方法高出3.1%。