Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes. This is because available benchmark datasets contain mostly visual-only modifications. However, a sophisticated deepfake may include small segments of audio or audio-visual manipulations that can completely change the meaning of the content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio-visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which efficiently captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA- TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
翻译:大多数深度伪造检测方法主要关注检测面部属性的空间和/或时空变化,这是因为现有基准数据集大多仅包含视觉层面的修改。然而,复杂的深度伪造可能包含短音频片段或音视频联合操纵,从而完全改变内容的含义。为填补这一空白,我们提出并构建了一个新型数据集——局部化音视频深度伪造(LAV-DF),该数据集包含策略性内容驱动的音频、视觉及音视频三模态操纵。提出的基线方法——边界感知时序伪造检测(BA-TFD)是一种基于3D卷积神经网络的架构,能够高效捕获多模态操纵。我们进一步改进了基线方法(即BA-TFD+),通过采用多尺度视觉Transformer替换主干网络,并引入对比损失、帧分类损失、边界匹配损失及多模态边界匹配损失函数来引导训练过程。定量分析表明,在包括我们新提出的数据集在内的多个基准数据集上,BA-TFD+在时序伪造定位与深度伪造检测任务中均展现出优越性能。数据集、模型及代码已开源至https://github.com/ControlNet/LAV-DF。