Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals and may propagate misinformation, while the latter undermine voice biometric security systems. AI-based detection can help to address these challenges by automatically differentiating between genuine and fabricated voice recordings. However, these models are only as good as their training data, which currently is severely limited due to an overwhelming concentration on English and Chinese audio in anti-spoofing databases, thus restricting its worldwide effectiveness. In response, this paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 52 TTS models, comprising 19 different architectures, to generate 160.1 hours of synthetic voice in 23 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD, and observe that MLAAD demonstrates superior performance over comparable datasets like InTheWild or FakeOrReal when used as a training resource. Furthermore, in comparison with the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, both excelling on four datasets. By publishing MLAAD and making trained models accessible via an interactive webserver , we aim to democratize antispoofing technology, making it accessible beyond the realm of specialists, thus contributing to global efforts against audio spoofing and deepfakes.
翻译:文本转语音(TTS)技术虽能带来显著优势(例如为语言障碍者赋予发声能力),却也催生了音频深度伪造与欺骗。前者会误导个体并可能传播虚假信息,后者则破坏语音生物识别系统的安全性。基于人工智能的检测技术可通过自动区分真实语音与伪造录音来应对这些挑战。然而,此类模型的性能完全取决于训练数据质量——当前反欺骗数据库过度集中于英语和中文音频,导致其全球适用性严重受限。为此,本文提出多语言音频反欺骗数据集(MLAAD),利用52个TTS模型(涵盖19种不同架构)生成23种语言的160.1小时合成语音。我们采用MLAAD训练并评估三种最先进的深度伪造检测模型,实验表明,MLAAD作为训练资源时表现优于InTheWild或FakeOrReal等同类数据集。此外,与知名ASVspoof 2019数据集相比,MLAAD展现出互补性:在跨八个数据集的测试中,MLAAD与ASVspoof 2019交替领先,各自在四个数据集上表现优异。通过公开发布MLAAD并提供交互式网络服务器访问已训练模型,我们旨在推动反欺骗技术的民主化,使其超越专业领域边界,为全球对抗音频欺骗与深度伪造贡献力量。