Existing methods on audio-visual deepfake detection mainly focus on high-level features for modeling inconsistencies between audio and visual data. As a result, these approaches usually overlook finer audio-visual artifacts, which are inherent to deepfakes. Herein, we propose the introduction of fine-grained mechanisms for detecting subtle artifacts in both spatial and temporal domains. First, we introduce a local audio-visual model capable of capturing small spatial regions that are prone to inconsistencies with audio. For that purpose, a fine-grained mechanism based on a spatially-local distance coupled with an attention module is adopted. Second, we introduce a temporally-local pseudo-fake augmentation to include samples incorporating subtle temporal inconsistencies in our training set. Experiments on the DFDC and the FakeAVCeleb datasets demonstrate the superiority of the proposed method in terms of generalization as compared to the state-of-the-art under both in-dataset and cross-dataset settings.
翻译:现有音视频深度伪造检测方法主要关注利用高层特征建模音频与视觉数据间的不一致性。因此,这些方法通常忽略了深度伪造固有的细粒度音视频伪影。本文提出引入细粒度机制以检测空间域与时域中的细微伪影。首先,我们提出一种局部音视频模型,能够捕捉易与音频产生不一致性的细微空间区域。为此,我们采用基于空间局部距离与注意力模块相结合的细粒度机制。其次,我们引入时域局部伪伪造增强方法,在训练集中纳入包含细微时序不一致性的样本。在DFDC和FakeAVCeleb数据集上的实验表明,与现有最优方法相比,所提方法在数据集内与跨数据集场景下均展现出更优的泛化性能。