Sound Check: Auditing Audio Datasets

Generative audio models are rapidly advancing in both capabilities and public utilization -- several powerful generative audio models have readily available open weights, and some tech companies have released high quality generative audio products. Yet, while prior work has enumerated many ethical issues stemming from the data on which generative visual and textual models have been trained, we have little understanding of similar issues with generative audio datasets, including those related to bias, toxicity, and intellectual property. To bridge this gap, we conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit in more detail. We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work. To enable artists to see if they are in popular audio datasets and facilitate exploration of the contents of these datasets, we developed a web tool audio datasets exploration tool at https://audio-audit.vercel.app.

翻译：生成式音频模型在能力和公共应用方面正迅速发展——若干强大的生成式音频模型已提供可公开获取的开放权重，部分科技公司也发布了高质量的生成式音频产品。然而，尽管先前研究已列举了生成式视觉与文本模型训练数据引发的诸多伦理问题，我们对生成式音频数据集存在的类似问题（包括与偏见、有害内容及知识产权相关的问题）仍知之甚少。为弥补这一认知差距，我们对数百个音频数据集进行了文献综述，并选取其中七个最具代表性的数据集进行了详细审计。研究发现，这些数据集存在对女性的系统性偏见，包含针对边缘化群体的有害刻板印象，且含有大量受版权保护的作品。为帮助艺术家查询其作品是否被收录于主流音频数据集，并促进对这些数据集内容的探索，我们开发了在线音频数据集探索工具，可通过 https://audio-audit.vercel.app 访问。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日