Augmented Datasheets for Speech Datasets and Ethical Decision-Making

Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.

翻译：语音数据集对于训练语音语言技术（SLT）至关重要；然而，底层训练数据多样性的缺失可能导致构建公平且稳健的SLT产品时存在严重局限，尤其是在语言、口音、方言、变体及言语障碍等维度，以及语音特征与社会经济、人口统计特征的交叉性方面。此外，对于底层训练数据——通常基于大规模网络爬取和/或公开可获取的语音资源——在数据收集伦理方面往往缺乏监管。为促进此类语音数据组件的标准化文档记录，我们提出了一种用于语音数据集的增强型数据表，该表可作为“数据集数据表”的补充。随后，基于对机器学习、语言学及健康等领域中语音数据使用的深入文献综述，我们例证了增强型数据表中各问题的重要性。最后，我们鼓励从数据集创建者到研究人员等从业者使用我们的增强型数据表，以更清晰地界定语音数据集的范围、属性及局限性，同时推动对数据主体保护与用户社区赋权的考量。伦理数据集创建并非“一刀切”的过程，但数据集创建者可借助我们的增强型数据表反思性考量相关SLT应用及数据源的社会语境，从而在下游催生更具包容性的SLT产品。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《自监督表示学习》报告，70页ppt

专知会员服务

86+阅读 · 2020年12月22日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日