Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for monetary and identity theft, we require a broad set of deepfake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. We study the frequency domain fingerprints of current audio generators. Building on top of the discovered frequency footprints, we train excellent lightweight detectors that generalize. We report improved results on the WaveFake dataset and an extended version. To account for the rapid progress in the field, we extend the WaveFake dataset by additionally considering samples drawn from the novel Avocodo and BigVGAN networks. For illustration purposes, the supplementary material contains audio samples of generator artifacts.
翻译:当今的生成式神经网络能够大规模创建高质量的合成语音。虽然我们欢迎这一新技术在创意领域的应用,但同时也必须认识到其风险。随着合成语音被滥用于货币盗窃和身份盗用,我们需要一套广泛的深度伪造识别工具。此外,先前的研究报告指出,深度分类器在泛化至未见过的音频生成器方面能力有限。我们研究了当前音频生成器的频域指纹。基于发现的频率痕迹,我们训练出能够实现泛化的优秀轻量级检测器。我们在WaveFake数据集及其扩展版本上报告了改进的结果。为应对该领域的快速进展,我们通过额外纳入来自新型Avocodo和BigVCO网络生成的样本,对WaveFake数据集进行了扩展。为便于说明,补充材料中提供了生成器伪影的音频样本。