Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception. Deepfakes are fake yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slandering, or spreading misinformation. Despite extensive research on unimodal deepfake detection, identifying complex deepfakes through joint analysis of audio and visual streams remains relatively unexplored. To fill this gap, this survey first provides an overview of audiovisual deepfake generation techniques, applications, and their consequences, and then provides a comprehensive review of state-of-the-art methods that combine audio and visual modalities to enhance detection accuracy, summarizing and critically analyzing their strengths and limitations. Furthermore, we discuss existing open source datasets for a deeper understanding, which can contribute to the research community and provide necessary information to beginners who want to analyze deep learning-based audiovisual methods for video forensics. By bridging the gap between unimodal and multimodal approaches, this paper aims to improve the effectiveness of deepfake detection strategies and guide future research in cybersecurity and media integrity.
翻译:深度学习已成功应用于多个领域,其在深度伪造检测方面的影响亦不例外。深度伪造是指虚假但逼真的合成内容,可被欺骗性地用于政治模仿、网络钓鱼、诽谤或传播虚假信息。尽管针对单模态深度伪造检测已开展了广泛研究,但通过联合分析音频和视觉流来识别复杂深度伪造的方法仍相对未被充分探索。为填补这一空白,本综述首先概述了视听深度伪造的生成技术、应用及其后果,随后全面回顾了结合音频与视觉模态以提升检测准确性的前沿方法,总结并批判性分析了其优势与局限。此外,我们讨论了现有的开源数据集,以促进更深入的理解,这些资源可为研究社区提供支持,并为希望分析基于深度学习的视听视频取证方法的初学者提供必要信息。通过弥合单模态与多模态方法之间的差距,本文旨在提升深度伪造检测策略的有效性,并为网络安全与媒体完整性领域的未来研究提供指引。