We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup. Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames. Unlike most approaches where pre-training is performed on a generic large corpus of images, we show that by pre-training on smaller face-related datasets, namely Celeb-A (for the spatial learning component) and YouTube Faces (for the temporal learning component), strong results can be obtained. We perform various experiments to evaluate the performance of our method on commonly used datasets namely FaceForensics++ (Low Quality and High Quality, along with a new highly compressed version named Very Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains competitive results on Celeb-DFv2. Moreover, our method outperforms other methods in the area in a cross-dataset setup where we fine-tune our model on FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset generalization ability.
翻译:我们提出了一种新颖的深度伪造视频检测方法,该方法使用一对通过自监督掩码自编码预训练的视觉Transformer。我们的方法由两个独立组件构成:一个专注于从视频的单个RGB帧中学习空间信息,另一个则从连续帧生成的光流场中学习时间一致性信息。与大多数在通用大型图像语料库上进行预训练的方法不同,我们证明,在较小的人脸相关数据集(即用于空间学习组件的Celeb-A和用于时间学习组件的YouTube Faces)上进行预训练,即可获得强劲结果。我们开展了多项实验,在常用数据集(包括FaceForensics++(低质量、高质量,以及新引入的高压缩版本极低质量)和Celeb-DFv2)上评估了方法的性能。实验表明,我们的方法在FaceForensics++(LQ、HQ和VLQ)上设立了新的最优结果,并在Celeb-DFv2上取得了具有竞争力的表现。此外,在跨数据集设置下——即我们在FaceForensics++上微调模型并在Celeb-DFv2上测试——我们的方法优于该领域其他方法,显示出其强大的跨数据集泛化能力。