We present a novel approach for the detection of deepfake videos using a pair of vision transformers pre-trained by a self-supervised masked autoencoding setup. Our method consists of two distinct components, one of which focuses on learning spatial information from individual RGB frames of the video, while the other learns temporal consistency information from optical flow fields generated from consecutive frames. Unlike most approaches where pre-training is performed on a generic large corpus of images, we show that by pre-training on smaller face-related datasets, namely Celeb-A (for the spatial learning component) and YouTube Faces (for the temporal learning component), strong results can be obtained. We perform various experiments to evaluate the performance of our method on commonly used datasets namely FaceForensics++ (Low Quality and High Quality, along with a new highly compressed version named Very Low Quality) and Celeb-DFv2 datasets. Our experiments show that our method sets a new state-of-the-art on FaceForensics++ (LQ, HQ, and VLQ), and obtains competitive results on Celeb-DFv2. Moreover, our method outperforms other methods in the area in a cross-dataset setup where we fine-tune our model on FaceForensics++ and test on CelebDFv2, pointing to its strong cross-dataset generalization ability.
翻译:我们提出了一种新颖的深度伪造视频检测方法,该方法利用一对通过自监督掩码自编码设置预训练的视觉Transformer。我们的方法包含两个不同的组件:一个专注于从视频的单个RGB帧中学习空间信息,另一个则从连续帧生成的光流场中学习时间一致性信息。与大多数在通用大型图像语料库上进行预训练的方法不同,我们证明,在较小的人脸相关数据集(即用于空间学习组件的Celeb-A和用于时间学习组件的YouTube Faces)上进行预训练即可获得优异的结果。我们进行了多项实验,以评估我们的方法在常用数据集(包括FaceForensics++(低质量、高质量以及新版本的高压缩低质量)和Celeb-DFv2)上的性能。实验表明,我们的方法在FaceForensics++(LQ、HQ和VLQ)上创下了新的最优结果,并在Celeb-DFv2上取得了具有竞争力的性能。此外,在跨数据集设置(我们在FaceForensics++上微调模型并在Celeb-DFv2上进行测试)中,我们的方法优于该领域的其他方法,展现了其强大的跨数据集泛化能力。