Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection

Deepfake techniques have been widely used for malicious purposes, prompting extensive research interest in developing Deepfake detection methods. Deepfake manipulations typically involve tampering with facial parts, which can result in inconsistencies across different parts of the face. For instance, Deepfake techniques may change smiling lips to an upset lip, while the eyes remain smiling. Existing detection methods depend on specific indicators of forgery, which tend to disappear as the forgery patterns are improved. To address the limitation, we propose Mover, a new Deepfake detection model that exploits unspecific facial part inconsistencies, which are inevitable weaknesses of Deepfake videos. Mover randomly masks regions of interest (ROIs) and recovers faces to learn unspecific features, which makes it difficult for fake faces to be recovered, while real faces can be easily recovered. Specifically, given a real face image, we first pretrain a masked autoencoder to learn facial part consistency by dividing faces into three parts and randomly masking ROIs, which are then recovered based on the unmasked facial parts. Furthermore, to maximize the discrepancy between real and fake videos, we propose a novel model with dual networks that utilize the pretrained encoder and masked autoencoder, respectively. 1) The pretrained encoder is finetuned for capturing the encoding of inconsistent information in the given video. 2) The pretrained masked autoencoder is utilized for mapping faces and distinguishing real and fake videos. Our extensive experiments on standard benchmarks demonstrate that Mover is highly effective.

翻译：深度伪造技术已被广泛用于恶意目的，促使研究者对开发深度伪造检测方法产生了浓厚兴趣。深度伪造操作通常涉及篡改面部部件，这可能导致面部不同区域间出现不一致性。例如，深度伪造技术可能将微笑的嘴唇改为不悦的嘴唇，而眼睛区域仍保持微笑状态。现有检测方法依赖于特定的伪造痕迹指标，但这些指标会随着伪造模式的改进而消失。为解决这一局限，我们提出了Mover，一种利用非特定面部部件不一致性（深度伪造视频难以避免的弱点）的新型检测模型。Mover随机掩码感兴趣区域（ROIs）并恢复人脸，以学习非特定特征，这使得伪造人脸难以被恢复，而真实人脸则易于恢复。具体而言，给定一张真实人脸图像，我们首先预训练一个掩码自编码器，通过将人脸分为三部分并随机掩码ROIs（随后基于未掩码的面部部件进行恢复）来学习面部部件一致性。此外，为最大化真实与伪造视频之间的差异，我们提出了一种新型双网络模型，分别利用预训练编码器和预训练掩码自编码器：1）对预训练编码器进行微调，以捕获给定视频中不一致信息的编码；2）利用预训练掩码自编码器映射人脸并区分真实与伪造视频。在标准基准上的大量实验表明，Mover具有极高的有效性。