This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score fusion. Specifically, it incorporates an audio deepfake detection module and an audio localization module to analyze and pinpoint manipulated segments in the audio stream. In parallel, an image-based deepfake detection and localization module is employed to process the visual modality. To effectively leverage complementary information across different modalities, we further propose a multimodal score fusion strategy that integrates the outputs from both audio and visual modules. Guided by a detailed analysis of the training and evaluation dataset, we explore and evaluate several score calculation and fusion strategies to improve system robustness. Overall, the final fusion-based system achieves an AUC of 0.87, an AP of 0.55, and an AR of 0.23 on the challenge test set, resulting in a final score of 0.5528.
翻译:本文提出了一种用于检测虚假音视频内容(即视频深度伪造)的系统,该系统为DDL挑战赛第二赛道而开发。所提出的系统采用两阶段框架,包括单模态检测与多模态分数融合。具体而言,该系统包含一个音频深度伪造检测模块和一个音频定位模块,用于分析并精确定位音频流中的篡改片段。同时,采用一个基于图像的深度伪造检测与定位模块来处理视觉模态。为了有效利用不同模态间的互补信息,我们进一步提出了一种多模态分数融合策略,该策略整合了音频和视觉模块的输出。基于对训练和评估数据集的详细分析指导,我们探索并评估了多种分数计算与融合策略以提升系统鲁棒性。总体而言,最终的基于融合的系统在挑战赛测试集上实现了0.87的AUC、0.55的AP和0.23的AR,最终得分为0.5528。