With the rise in manipulated media, deepfake detection has become an imperative task for preserving the authenticity of digital content. In this paper, we present a novel multi-modal audio-video framework designed to concurrently process audio and video inputs for deepfake detection tasks. Our model capitalizes on lip synchronization with input audio through a cross-attention mechanism while extracting visual cues via a fine-tuned VGG-16 network. Subsequently, a transformer encoder network is employed to perform facial self-attention. We conduct multiple ablation studies highlighting different strengths of our approach. Our multi-modal methodology outperforms state-of-the-art multi-modal deepfake detection techniques in terms of F-1 and per-video AUC scores.
翻译:随着操纵媒体的兴起,深度伪造检测已成为维护数字内容真实性的迫切任务。本文提出了一种新颖的多模态音视频框架,旨在同时处理音频与视频输入以完成深度伪造检测任务。该模型通过交叉注意力机制利用唇部与输入音频的同步性,同时通过微调的VGG-16网络提取视觉特征。随后,采用Transformer编码器网络实现人脸自注意力机制。我们进行了多项消融研究,凸显了方法的不同优势。所提出的多模态方法在F-1分数和逐视频AUC分数上均优于当前最优的多模态深度伪造检测技术。