Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the available datasets are heavily skewed towards English and Chinese audio, which limits the global applicability of these anti-spoofing systems. To address this limitation, this paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance over comparable datasets like InTheWild and Fake- OrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing MLAAD and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.
翻译:文本转语音(TTS)技术带来了显著益处,例如为言语障碍者提供发声能力,但它也助长了音频深度伪造和欺骗攻击的产生。基于人工智能的检测方法有助于缓解这些风险;然而,此类模型的性能本质上取决于其训练数据的质量与多样性。目前,现有数据集严重偏向英语和中文音频,这限制了这些反欺骗系统的全球适用性。为应对这一局限,本文提出了多语言音频反欺骗数据集(MLAAD),该数据集使用82个TTS模型(涵盖33种不同架构)生成总计378.0小时、涉及38种不同语言的合成语音。我们使用MLAAD训练并评估了三种最先进的深度伪造检测模型,发现其作为训练资源时,在性能上优于InTheWild、Fake-OrReal等同类数据集。此外,与著名的ASVspoof 2019数据集相比,MLAAD被证明是一种互补性资源。在跨八个数据集的测试中,MLAAD与ASVspoof 2019交替表现出优势,各自在四个数据集上取得最佳结果。通过发布MLAAD并通过交互式网络服务器提供训练好的模型,我们旨在普及反欺骗技术,使其超越专家领域,为全球范围内对抗音频欺骗和深度伪造的努力做出贡献。