Today's generative neural networks allow the creation of high-quality synthetic speech at scale. While we welcome the creative use of this new technology, we must also recognize the risks. As synthetic speech is abused for both monetary and identity theft, we require a broad set of deep fake identification tools. Furthermore, previous work reported a limited ability of deep classifiers to generalize to unseen audio generators. By leveraging the wavelet-packet and short-time Fourier transform, we train excellent lightweight detectors that generalize. We report improved results on an extension of the WaveFake dataset. To account for the rapid progress in the field, we additionally consider samples drawn from the novel Avocodo and BigVGAN networks.
翻译:当今的生成式神经网络能够大规模创建高质量合成语音。尽管我们欢迎这项新技术的创造性应用,但也必须认识到其风险。由于合成语音被滥用于金钱诈骗和身份盗窃,我们需要一套广泛的深度赝品识别工具。此外,先前的研究报道了深度分类器在泛化至未见音频生成器方面的能力有限。通过利用小波包和短时傅里叶变换,我们训练出能够泛化的优秀轻量级检测器。我们在WaveFake数据集的扩展版本上报告了改进的结果。为应对该领域的快速进展,我们还考虑了从新颖的Avocodo和BigVGAN网络中抽取的样本。