Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the available datasets are heavily skewed towards English and Chinese audio, which limits the global applicability of these anti-spoofing systems. To address this limitation, this paper presents the Multi-Language Audio Anti-Spoofing Dataset (MLAAD), version 9, created using 140 TTS models, comprising 78 different architectures, to generate 678,3 hours of synthetic voice in 51 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance over comparable datasets like InTheWild and Fake-Or-Real when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing MLAAD and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.
翻译:文本转语音(TTS)技术带来了显著益处,例如为有言语障碍的人士提供语音支持,但它也助长了音频深度伪造和欺骗攻击的产生。基于人工智能的检测方法有助于降低这些风险;然而,此类模型的性能本质上取决于其训练数据的质量与多样性。目前,可用的数据集严重偏向英语和中文音频,这限制了这些反欺骗系统的全球适用性。为解决这一局限,本文提出了多语言音频反欺骗数据集(MLAAD)第9版,该数据集使用140个TTS模型(涵盖78种不同架构)生成,包含51种不同语言共计678.3小时的合成语音。我们使用MLAAD训练并评估了三种最先进的深度伪造检测模型,发现其作为训练资源时,在性能上优于InTheWild和Fake-Or-Real等同类数据集。此外,与著名的ASVspoof 2019数据集相比,MLAAD被证明是一种互补性资源。在八个数据集上的测试中,MLAAD与ASVspoof 2019交替表现更优,各自在四个数据集上领先。通过发布MLAAD并通过交互式网络服务器提供训练好的模型,我们旨在普及反欺骗技术,使其超越专家领域,为全球范围内对抗音频欺骗和深度伪造的努力做出贡献。