In recent years, the explosive advancement of deepfake technology has posed a critical and escalating threat to public security: diffusion-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency via multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the new large-scale multimodal digital human forgery dataset based on diffusion models. Leveraging five of the latest digital human generation methods and a voice cloning method, we systematically construct a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies demonstrate that the misrecognition rate by participants for DigiFakeAV reaches as high as 68%. Moreover, the substantial performance degradation of existing detection models on our dataset further highlights its challenges. To address this problem, we propose DigiShield, an effective detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves state-of-the-art (SOTA) performance on the DigiFakeAV and shows strong generalization on other datasets.
翻译:近年来,深度伪造技术的爆炸性发展对公共安全构成了日益严峻的关键威胁:基于扩散的数字人生成。与传统面部操纵方法不同,此类模型能够通过多模态控制信号生成具有一致性的高度逼真视频。其灵活性与隐蔽性对现有检测策略构成了严峻挑战。为弥合这一差距,我们引入了DigiFakeAV,这是一个基于扩散模型的新型大规模多模态数字人伪造数据集。利用五种最新的数字人生成方法和一种语音克隆技术,我们系统性地构建了一个包含60,000个视频(840万帧)的数据集,涵盖多种国籍、肤色、性别及现实场景,显著提升了数据的多样性与真实感。用户研究表明,参与者对DigiFakeAV的误判率高达68%。此外,现有检测模型在我们数据集上的性能大幅下降进一步凸显了其挑战性。为解决此问题,我们提出了DigiShield——一种基于时空与跨模态融合的有效检测基线方法。通过联合建模视频的3D时空特征和音频的语义-声学特征,DigiShield在DigiFakeAV上实现了最先进的性能,并在其他数据集上展现出强大的泛化能力。