AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real-world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open-world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum-learning-based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR-based detectors trained on AUDETER achieve strong cross-domain performance across multiple benchmarks, achieving an EER of 1.87% on In-the-Wild. AUDETER is available on GitHub.

翻译：语音合成系统现已能够生成高度逼真的语音，这对真实性验证构成了重大挑战。尽管深度伪造检测模型已取得显著进展，但由于人类语音的复杂性及合成系统的快速演进，训练数据与测试数据间不断变化的分布偏移常常削弱了这些模型在实际应用中的有效性。现有数据集普遍存在真实语音多样性不足、对近期合成系统覆盖不充分以及深度伪造来源混杂等问题，这阻碍了系统性评估和开放世界模型训练。为解决这些问题，我们提出了AUDETER（音频深度伪造测试范围），这是一个大规模且高度多样化的深度伪造音频数据集，包含由11种近期TTS模型和10种声码器生成的超过4,500小时合成音频，总计300万条片段。我们进一步观察到，现有检测器大多采用二元监督训练，当训练数据包含高度多样化的深度伪造模式时，这种训练方式可能引发不同合成源之间的负迁移，从而影响整体泛化能力。作为补充贡献，我们提出了一种基于课程学习的有效方法来缓解这种效应。大量实验表明，现有检测模型难以泛化至AUDETER中的新型深度伪造和人类语音，而基于AUDETER训练的XLR检测器在多个基准测试中展现出强大的跨域性能，在In-the-Wild数据集上实现了1.87%的等错误率。AUDETER已在GitHub平台开源。