The landscape of synthetic media has been irrevocably altered by text-to-video (T2V) models, whose outputs are rapidly approaching indistinguishability from reality. Critically, this technology is no longer confined to large-scale labs; the proliferation of efficient, open-source generators is democratizing the ability to create high-fidelity synthetic content on consumer-grade hardware. This makes existing face-centric and manipulation-based benchmarks obsolete. To address this urgent threat, we introduce SynthForensics, to the best of our knowledge the first human-centric benchmark for detecting purely synthetic video deepfakes. The benchmark comprises 6,815 unique videos from five architecturally distinct, state-of-the-art open-source T2V models. Its construction was underpinned by a meticulous two-stage, human-in-the-loop validation to ensure high semantic and visual quality. Each video is provided in four versions (raw, lossless, light, and heavy compression) to enable real-world robustness testing. Experiments demonstrate that state-of-the-art detectors are both fragile and exhibit limited generalization when evaluated on this new domain: we observe a mean performance drop of $29.19\%$ AUC, with some methods performing worse than random chance, and top models losing over 30 points under heavy compression. The paper further investigates the efficacy of training on SynthForensics as a means to mitigate these observed performance gaps, achieving robust generalization to unseen generators ($93.81\%$ AUC), though at the cost of reduced backward compatibility with traditional manipulation-based deepfakes. The complete dataset and all generation metadata, including the specific prompts and inference parameters for every video, will be made publicly available at [link anonymized for review].
翻译:文本到视频(T2V)模型已不可逆转地改变了合成媒体的格局,其输出正迅速逼近与真实视频难以区分的程度。至关重要的是,该技术不再局限于大型实验室;高效、开源生成器的激增正在使在消费级硬件上创建高保真合成内容的能力民主化。这使得现有的以人脸为中心和基于操作的基准变得过时。为应对这一紧迫威胁,我们引入了SynthForensics,据我们所知,这是首个用于检测纯合成视频深度伪造的以人为中心的基准。该基准包含来自五个架构不同、最先进的开源T2V模型的6,815个独特视频。其构建基于一个细致的两阶段、人在环验证过程,以确保高语义和视觉质量。每个视频提供四个版本(原始、无损、轻度压缩和重度压缩),以支持现实世界中的鲁棒性测试。实验表明,最先进的检测器在这一新领域进行评估时既脆弱又表现出有限的泛化能力:我们观察到平均性能下降$29.19\%$ AUC,部分方法表现甚至低于随机猜测,顶级模型在重度压缩下损失超过30个点。本文进一步研究了在SynthForensics上进行训练作为缓解这些观察到的性能差距方法的有效性,实现了对未见生成器的鲁棒泛化($93.81\%$ AUC),但代价是降低了与传统基于操作的深度伪造的向后兼容性。完整的数据集及所有生成元数据,包括每个视频的具体提示词和推理参数,将在[为审阅而匿名的链接]公开提供。