MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector

Marta R. Costa-jussà,Mariano Coria Meglioli,Pierre Andrews,David Dale,Prangthip Hansanti,Elahe Kalbassi,Alex Mourachko,Christophe Ropers,Carleigh Wood

Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 19 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier outperforms existing text-based trainable classifiers by more than 1% AUC, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves precision and recall by approximately 2.5 times. This significant improvement underscores the potential of MuTox in advancing the field of audio-based toxicity detection.

翻译：在自然语言处理领域中，针对语音模态（音频）的毒性检测研究相当有限，尤其对于英语以外的语种。为解决这一局限性并奠定真正多语言音频毒性检测的基础，我们提出MuTox——首个具备毒性标注的高覆盖多语言音频数据集。该数据集包含英语和西班牙语的20,000条音频话语，以及其他19种语言各4,000条。为验证该数据集质量，我们训练了MuTox音频毒性分类器，该分类器支持跨广泛语种的零样本毒性检测。相较现有基于文本的可训练分类器，该分类器的AUC值提升超过1%，同时将语言覆盖范围扩大十倍以上。与覆盖语种数量相当的基于词表的分类器相比，MuTox在精确率和召回率上均实现约2.5倍的提升。这一显著进步彰显了MuTox在推动音频毒性检测领域发展方面的潜力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日