The rapid progress of deep speech synthesis models has posed significant threats to society such as malicious content manipulation. Therefore, many studies have emerged to detect the so-called deepfake audio. However, existing works focus on the binary detection of real audio and fake audio. In real-world scenarios such as model copyright protection and digital evidence forensics, it is needed to know what tool or model generated the deepfake audio to explain the decision. This motivates us to ask: Can we recognize the system fingerprints of deepfake audio? In this paper, we present the first deepfake audio dataset for system fingerprint recognition (SFR) and conduct an initial investigation. We collected the dataset from the speech synthesis systems of seven Chinese vendors that use the latest state-of-the-art deep learning technologies, including both clean and compressed sets. In addition, to facilitate the further development of system fingerprint recognition methods, we provide extensive benchmarks that can be compared and research findings. The dataset will be publicly available. .
翻译:深度语音合成模型的快速发展对人类社会构成了重大威胁,例如恶意内容操纵。因此,大量研究涌现以检测所谓的深度伪造音频。然而,现有工作聚焦于真实音频与伪造音频的二分类检测。在模型版权保护、数字证据取证等真实场景中,需要了解生成深度伪造音频的具体工具或模型以解释判决。这促使我们提出疑问:能否识别深度伪造音频的系统指纹?本文首次构建了面向系统指纹识别(SFR)的深度伪造音频数据集,并开展了初步探究。我们从七家中国供应商的语音合成系统中采集数据集——这些系统均采用最新最先进的深度学习技术,包含纯净集与压缩集。此外,为促进系统指纹识别方法的进一步发展,我们提供了可供对比的广泛基准测试及研究成果。该数据集将公开提供。