Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
翻译:近日,深度伪造技术因其对社交媒体数字取证领域安全与隐私的威胁而引发公众广泛关注。随着互联网上广泛传播的深度伪造视频愈发逼真,传统检测技术已难以区分真伪。现有深度学习方法大多以卷积神经网络为主干,重点关注人脸图像的局部特征与关联。然而,局部特征与关联不足以使模型学习到足够的通用信息用于深度伪造检测,导致现有检测方法的性能提升遭遇瓶颈。为解决该问题,我们提出一种深度卷积Transformer,用于在局部与全局层面整合关键图像特征。具体而言,我们采用卷积池化与再注意力机制来丰富提取的特征并提升效率。此外,我们首次在模型训练中引入鲜少被讨论的图像关键帧以提升性能,并可视化视频压缩导致的关键帧与正常帧之间的特征数量差距。最后,通过在多个深度伪造基准数据集上的大量实验,我们证明了所提方法的可迁移性。在数据集内部及跨数据集实验中,本方案均持续优于多个现有最优基线方法。