Recently, Deepfake has drawn considerable public attention due to security and privacy concerns in social media digital forensics. As the wildly spreading Deepfake videos on the Internet become more realistic, traditional detection techniques have failed in distinguishing between real and fake. Most existing deep learning methods mainly focus on local features and relations within the face image using convolutional neural networks as a backbone. However, local features and relations are insufficient for model training to learn enough general information for Deepfake detection. Therefore, the existing Deepfake detection methods have reached a bottleneck to further improve the detection performance. To address this issue, we propose a deep convolutional Transformer to incorporate the decisive image features both locally and globally. Specifically, we apply convolutional pooling and re-attention to enrich the extracted features and enhance efficacy. Moreover, we employ the barely discussed image keyframes in model training for performance improvement and visualize the feature quantity gap between the key and normal image frames caused by video compression. We finally illustrate the transferability with extensive experiments on several Deepfake benchmark datasets. The proposed solution consistently outperforms several state-of-the-art baselines on both within- and cross-dataset experiments.
翻译:近年来,Deepfake因社交媒体数字取证中的安全与隐私问题引起了广泛关注。随着互联网上广泛传播的Deepfake视频日益逼真,传统检测技术已难以区分真伪。现有深度学习方法大多以卷积神经网络为主干,聚焦于人脸图像内的局部特征与关系。然而,局部特征与关系不足以使模型训练获取足够的一般性信息以实现Deepfake检测。因此,现有Deepfake检测方法在进一步提升检测性能方面遇到了瓶颈。为解决这一问题,我们提出了一种深度卷积Transformer,以局部和全局方式融合关键图像特征。具体而言,我们应用卷积池化与再注意力机制来丰富提取的特征并增强效能。此外,我们在模型训练中采用了鲜有讨论的图像关键帧以提升性能,并可视化了视频压缩导致的关键帧与普通图像帧之间的特征数量差距。最后,我们在多个Deepfake基准数据集上通过大量实验展示了该方法的可迁移性。所提出的解决方案在数据集内和跨数据集实验中均持续优于多个最先进的基线方法。