Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
翻译:近期,深度伪造技术因社交媒体数字取证中的安全与隐私问题引发广泛关注。随着互联网上广泛传播的深度伪造视频日益逼真,传统检测技术已难以区分真实与伪造内容。现有深度学习方法大多以卷积神经网络为主干,主要关注面部图像的局部特征与关联。然而,局部特征与关联不足以使模型训练学习到足够的通用信息用于深度伪造检测。因此,现有深度伪造检测方法在进一步提升检测性能方面陷入瓶颈。为解决此问题,我们提出一种深度卷积Transformer,通过全局与局部结合的方式整合关键图像特征。具体而言,我们采用卷积池化与再注意力机制以丰富提取的特征并增强效能。此外,我们将此前鲜少探讨的图像关键帧引入模型训练以提升性能,并可视化视频压缩导致的关键帧与普通帧之间的特征量差异。最后,我们在多个深度伪造基准数据集上进行广泛实验,验证了所提方法的可迁移性。在数据集内与跨数据集实验中,该方案持续优于多个现有最优基线方法。