Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes. This is because available benchmark datasets contain mostly visual-only modifications. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which efficiently captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA- TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
翻译:大多数深度伪造检测方法侧重于检测面部属性中的空间和/或时空变化。这是由于现有基准数据集大多仅包含视觉层面的修改。然而,一种复杂的深度伪造可能包含可完全改变内容含义的短时音频或音视频操作片段。为填补这一空白,我们提出并基准测试了一个新数据集——局部音视频深度伪造(LAV-DF),其中包含策略性的内容驱动型音频、视觉及音视频操作。所提出的基线方法——边界感知时序伪造检测(BA-TFD)——是一种基于3D卷积神经网络的架构,可高效捕获多模态操作。我们通过将主干网络替换为多尺度视觉Transformer(Multiscale Vision Transformer),并采用对比损失、帧分类损失、边界匹配损失以及多模态边界匹配损失函数指导训练过程,进一步改进了基线方法(即BA-TFD+)。定量分析表明,BA-TFD+在多个基准数据集(包括我们新提出的数据集)上的时序伪造定位与深度伪造检测任务中展现出优越性。该数据集、模型及代码可在 https://github.com/ControlNet/LAV-DF 获取。