AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

Forged content shared widely on social media platforms is a major social problem that requires increased regulation and poses new challenges to the research community. The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilizes visual modality or audio modality. While there are some methods in the literature that exploit audio and visual modalities to detect forged videos, they have not been comprehensively evaluated on multi-modal datasets of deepfake videos involving acoustic and visual manipulations. Moreover, these existing methods are mostly based on CNN and suffer from low detection accuracy. Inspired by the recent success of Transformer in various fields, to address the challenges posed by deepfake technology, in this paper, we propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation to achieve effective video forgery detection. Specifically, the proposed model integrates several purely transformer-based variants that capture video, audio, and audio-visual salient cues to reach a consensus in prediction. For evaluation, we use the recently released benchmark multi-modal audio-video FakeAVCeleb dataset. For a detailed analysis, we evaluate AVTENet, its variants, and several existing methods on multiple test sets of the FakeAVCeleb dataset. Experimental results show that our best model outperforms all existing methods and achieves state-of-the-art performance on Testset-I and Testset-II of the FakeAVCeleb dataset.

翻译：社交媒体平台上广泛传播的伪造内容是一个重大社会问题，需要加强监管，同时也给研究界带来了新的挑战。近年来超逼真深度伪造视频的激增，凸显了音频和视觉伪造的威胁。以往大多数关于检测AI生成伪造视频的工作仅利用视觉模态或音频模态。尽管文献中存在一些利用音频和视觉模态检测伪造视频的方法，但这些方法尚未在涉及声学与视觉操纵的深度伪造视频多模态数据集上得到全面评估。此外，这些现有方法大多基于CNN，存在检测精度低的问题。受Transformer近年来在各个领域取得成功的启发，为应对深度伪造技术带来的挑战，本文提出了一种基于音频-视觉Transformer的集成网络（AVTENet）框架，该框架同时考虑声学操纵与视觉操纵，以实现有效的视频伪造检测。具体而言，该模型集成了多个纯Transformer变体，这些变体分别捕捉视频、音频及音频-视觉显著线索，以达成预测一致性。在评估方面，我们使用了近期发布的多模态音频-视频基准数据集FakeAVCeleb。为进行详细分析，我们在FakeAVCeleb数据集的多个测试集上评估了AVTENet及其变体以及若干现有方法。实验结果表明，我们的最佳模型在所有现有方法中表现最优，并在FakeAVCeleb数据集的Testset-I和Testset-II上达到了最先进的性能。