With the rising prevalence of deepfakes, there is a growing interest in developing generalizable detection methods for various types of deepfakes. While effective in their specific modalities, traditional detection methods fall short in addressing the generalizability of detection across diverse cross-modal deepfakes. This paper aims to explicitly learn potential cross-modal correlation to enhance deepfake detection towards various generation scenarios. Our approach introduces a correlation distillation task, which models the inherent cross-modal correlation based on content information. This strategy helps to prevent the model from overfitting merely to audio-visual synchronization. Additionally, we present the Cross-Modal Deepfake Dataset (CMDFD), a comprehensive dataset with four generation methods to evaluate the detection of diverse cross-modal deepfakes. The experimental results on CMDFD and FakeAVCeleb datasets demonstrate the superior generalizability of our method over existing state-of-the-art methods. Our code and data can be found at \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection}.
翻译:随着深度伪造技术的日益普及,开发针对各类深度伪造的通用化检测方法引起了广泛关注。传统检测方法在特定模态下效果显著,但在应对跨模态深度伪造的检测泛化性方面存在不足。本文旨在通过显式学习潜在的跨模态相关性,提升针对多种生成场景的深度伪造检测能力。我们提出了一种相关性蒸馏任务,该任务基于内容信息对固有关联的跨模态相关性进行建模。这一策略有助于防止模型过度拟合于单纯的音视频同步特征。此外,我们构建了跨模态深度伪造数据集(CMDFD),该综合数据集包含四种生成方法,用于评估对多样化跨模态深度伪造的检测效果。在CMDFD和FakeAVCeleb数据集上的实验结果表明,我们的方法相较于现有最先进方法具有更优的泛化性能。我们的代码与数据可在 \url{https://github.com/ljj898/CMDFD-Dataset-and-Deepfake-Detection} 获取。