This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
翻译:本文致力于解决开发鲁棒的音视频深度伪造检测模型所面临的挑战。在实际应用场景中,新的生成算法不断涌现,而这些算法在检测方法开发阶段并未被遇到。这要求检测方法具备良好的泛化能力。此外,为确保检测方法的可信度,模型若能解释视频中哪些线索表明其伪造性将大有裨益。基于这些考量,我们提出一种结合单类学习的多流融合方法,作为一种表示层面的正则化技术。我们通过扩展现有的FakeAVCeleb数据集并重新划分,创建了一个新的基准测试集,以研究音视频深度伪造检测的泛化问题。该基准测试集包含四类伪造视频(真实音频-伪造视觉、伪造音频-伪造视觉、伪造音频-真实视觉以及非同步视频)。实验结果表明,与基线模型相比,我们的方法在四个测试集上对未见攻击的检测性能平均提升了7.31%。此外,我们提出的框架具备可解释性,能够指示模型识别出哪个模态是伪造的。