The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.
翻译:生成式AI模型的快速发展正催生出更逼真的深度伪造媒体,涵盖音频、视频或二者的操纵,这引发了严重的隐私与社会担忧。该领域的大量研究已取得颇具前景的域内结果,然而这些模型在面对来自不同领域的数据时,其效能常出现下降。因此,近期深度伪造检测方法聚焦于通过整合所有输入模态(包括音频、图像及其交互)的多种技术来提升泛化能力。为此,我们提出EAV-DFD方法——一种结合领域自适应机制的广义深度集成音视频模型(EAV-DFD),该机制利用师生框架增强模型在未见领域中的表现与泛化能力。为评估模型性能,我们以FakeAVCeleb数据集作为主领域,以DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。实验结果表明,所提框架在领域自适应中高效,仅使用少量未见领域数据训练学生模型,即可使模型在三个未见数据集上的AUC性能分别提升4.09%、17.94%和0.5%。这催生出一种能够适应新领域并解释哪些模态遭到操纵的新型深度伪造检测模型,凸显了该方法在实际应用中的潜力。