Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested "in the wild". Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
翻译:音频生成技术的最新进展导致深度伪造音频数量不断增加,使公众更容易遭受金融诈骗、身份盗窃和虚假信息的侵害。音频深度伪造检测器有望缓解这一问题,许多近期研究报道其准确率接近99%。然而,这些方法通常在域内设置中进行测试,即训练集和测试集中的深度伪造样本由相同的生成模型产生。为此,我们提出了XMAD-Bench,一个大规模跨领域多语言音频深度伪造基准,包含668.8小时的真实语音和深度伪造语音。在我们的新颖数据集中,说话者、生成方法以及真实音频来源在训练集和测试集之间是完全独立的。这形成了一个具有挑战性的跨领域评估设置,使得音频深度伪造检测器能够在"真实场景"下接受检验。我们的域内和跨领域实验表明,深度伪造检测器的域内性能(通常高达100%)与相同模型的跨领域性能(有时接近随机猜测水平)之间存在明显差距。本基准强调了开发鲁棒音频深度伪造检测器的必要性,这些检测器应能在不同语言、说话者、生成方法和数据源之间保持泛化能力。我们的基准已通过https://github.com/ristea/xmad-bench/公开发布。