A Unified Framework for Modality-Agnostic Deepfakes Detection

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

As AI-generated content (AIGC) thrives, deepfakes have expanded from single-modality falsification to cross-modal fake content creation, where either audio or visual components can be manipulated. While using two unimodal detectors can detect audio-visual deepfakes, cross-modal forgery clues could be overlooked. Existing multimodal deepfake detection methods typically establish correspondence between the audio and visual modalities for binary real/fake classification, and require the co-occurrence of both modalities. However, in real-world multi-modal applications, missing modality scenarios may occur where either modality is unavailable. In such cases, audio-visual detection methods are less practical than two independent unimodal methods. Consequently, the detector can not always obtain the number or type of manipulated modalities beforehand, necessitating a fake-modality-agnostic audio-visual detector. In this work, we introduce a comprehensive framework that is agnostic to fake modalities, which facilitates the identification of multimodal deepfakes and handles situations with missing modalities, regardless of the manipulations embedded in audio, video, or even cross-modal forms. To enhance the modeling of cross-modal forgery clues, we employ audio-visual speech recognition (AVSR) as a preliminary task. This efficiently extracts speech correlations across modalities, a feature challenging for deepfakes to replicate. Additionally, we propose a dual-label detection approach that follows the structure of AVSR to support the independent detection of each modality. Extensive experiments on three audio-visual datasets show that our scheme outperforms state-of-the-art detection methods with promising performance on modality-agnostic audio/video deepfakes.

翻译：随着人工智能生成内容（AIGC）的蓬勃发展，深度伪造已从单模态伪造扩展至跨模态虚假内容创建，其中音频或视觉组件均可被篡改。虽然使用两个单模态检测器可以检测音视频深度伪造，但跨模态伪造线索可能被忽略。现有的多模态深度伪造检测方法通常建立音频与视觉模态之间的对应关系以进行二元真实/虚假分类，并需要两种模态同时出现。然而，在现实世界的多模态应用中，可能出现缺失模态的场景，即任一模态不可用。在这种情况下，音视频检测方法不如两个独立的单模态方法实用。因此，检测器无法始终预先获知被篡改模态的数量或类型，从而需要一种对虚假模态不可知的音视频检测器。本文提出了一种对虚假模态不可知的综合框架，能够识别多模态深度伪造并处理模态缺失情况，无论篡改嵌入在音频、视频甚至是跨模态形式中。为增强跨模态伪造线索的建模能力，我们采用音视频语音识别（AVSR）作为前置任务，高效提取跨模态语音相关性——这是深度伪造难以复现的特征。此外，我们提出遵循AVSR结构的双标签检测方法，以支持对每种模态的独立检测。在三个音视频数据集上的大量实验表明，我们的方案在面向模态不可知的音频/视频深度伪造检测中性能优于现有最优方法。