Detecting deepfake videos is highly challenging due to the complex intertwined spatial and temporal artifacts in forged sequences. Most recent approaches rely on binary classifiers trained on both real and fake data. However, such methods may struggle to focus on important artifacts, which can hinder their generalization capability. Additionally, these models often lack interpretability, making it difficult to understand how predictions are made. To address these issues, we propose FakeSTormer, offering two key contributions. First, we introduce a multi-task learning framework with additional spatial and temporal branches that enable the model to focus on subtle spatio-temporal artifacts. These branches also provide interpretability by highlighting video regions that may contain artifacts. Second, we propose a video-level data synthesis algorithm that generates pseudo-fake videos with subtle artifacts, providing the model with high-quality samples and ground truth data for our spatial and temporal branches. Extensive experiments on several challenging benchmarks demonstrate the competitiveness of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.
翻译:深度伪造视频检测因伪造序列中复杂交织的空间与时间伪影而极具挑战性。现有方法大多依赖于在真实与伪造数据上训练的二元分类器,但此类方法往往难以聚焦关键伪影,从而限制其泛化能力。此外,这些模型通常缺乏可解释性,难以理解其预测依据。为解决这些问题,我们提出FakeSTormer模型,其核心贡献包含两方面:首先,我们设计了一个多任务学习框架,通过引入额外的空间分支与时间分支,使模型能够专注于细微的时空伪影。这些分支通过高亮可能包含伪影的视频区域,同时提供了模型的可解释性。其次,我们提出一种视频级数据合成算法,能够生成包含细微伪影的伪伪造视频,为模型及我们的时空分支提供高质量样本与真值数据。在多个具有挑战性的基准数据集上的大量实验表明,该方法相较于当前最先进方法具有显著竞争力。代码已开源:https://github.com/10Ring/FakeSTormer。