MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Text-to-Speech (TTS) technology offers notable benefits, such as providing a voice for individuals with speech impairments, but it also facilitates the creation of audio deepfakes and spoofing attacks. AI-based detection methods can help mitigate these risks; however, the performance of such models is inherently dependent on the quality and diversity of their training data. Presently, the available datasets are heavily skewed towards English and Chinese audio, which limits the global applicability of these anti-spoofing systems. To address this limitation, this paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 82 TTS models, comprising 33 different architectures, to generate 378.0 hours of synthetic voice in 38 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD and observe that it demonstrates superior performance over comparable datasets like InTheWild and Fake- OrReal when used as a training resource. Moreover, compared to the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, each excelling on four datasets. By publishing MLAAD and making a trained model accessible via an interactive webserver, we aim to democratize anti-spoofing technology, making it accessible beyond the realm of specialists, and contributing to global efforts against audio spoofing and deepfakes.

翻译：文本转语音（TTS）技术带来了显著益处，例如为言语障碍者提供发声能力，但它也助长了音频深度伪造和欺骗攻击的产生。基于人工智能的检测方法有助于缓解这些风险；然而，此类模型的性能本质上取决于其训练数据的质量与多样性。目前，现有数据集严重偏向英语和中文音频，这限制了这些反欺骗系统的全球适用性。为应对这一局限，本文提出了多语言音频反欺骗数据集（MLAAD），该数据集使用82个TTS模型（涵盖33种不同架构）生成总计378.0小时、涉及38种不同语言的合成语音。我们使用MLAAD训练并评估了三种最先进的深度伪造检测模型，发现其作为训练资源时，在性能上优于InTheWild、Fake-OrReal等同类数据集。此外，与著名的ASVspoof 2019数据集相比，MLAAD被证明是一种互补性资源。在跨八个数据集的测试中，MLAAD与ASVspoof 2019交替表现出优势，各自在四个数据集上取得最佳结果。通过发布MLAAD并通过交互式网络服务器提供训练好的模型，我们旨在普及反欺骗技术，使其超越专家领域，为全球范围内对抗音频欺骗和深度伪造的努力做出贡献。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日