All current benchmarks for multimodal deepfake detection manipulate entire frames using various generation techniques, resulting in oversaturated detection accuracies exceeding 94% at the video-level classification. However, these benchmarks struggle to detect dynamic deepfake attacks with challenging frame-by-frame alterations presented in real-world scenarios. To address this limitation, we introduce FakeMix, a novel clip-level evaluation benchmark aimed at identifying manipulated segments within both video and audio, providing insight into the origins of deepfakes. Furthermore, we propose novel evaluation metrics, Temporal Accuracy (TA) and Frame-wise Discrimination Metric (FDM), to assess the robustness of deepfake detection models. Evaluating state-of-the-art models against diverse deepfake benchmarks, particularly FakeMix, demonstrates the effectiveness of our approach comprehensively. Specifically, while achieving an Average Precision (AP) of 94.2% at the video-level, the evaluation of the existing models at the clip-level using the proposed metrics, TA and FDM, yielded sharp declines in accuracy to 53.1%, and 52.1%, respectively.
翻译:当前所有多模态深度伪造检测基准均采用各类生成技术对完整帧进行操纵,导致视频级分类的检测准确率超过94%,呈现过饱和状态。然而,这些基准在应对现实场景中具有挑战性的逐帧篡改动态深度伪造攻击时表现欠佳。为突破此局限,我们提出FakeMix——一种新颖的片段级评估基准,旨在识别视频与音频中的篡改片段,从而揭示深度伪造的生成来源。此外,我们提出时序准确率(TA)与逐帧判别度量(FDM)两项创新评估指标,用以系统评估深度伪造检测模型的鲁棒性。通过对前沿模型在不同深度伪造基准(尤其是FakeMix)上的综合评估,本方法的有效性得到全面验证。具体而言,现有模型在视频级达到94.2%的平均精度(AP),而采用TA与FDM指标在片段级进行评估时,准确率分别骤降至53.1%与52.1%。