Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets, which can be used in addition to "Datasheets for Datasets". We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
翻译:语音数据集对于训练语音语言技术(SLT)至关重要;然而,底层训练数据多样性的缺失可能导致构建公平且稳健的SLT产品时存在严重局限,尤其是在语言、口音、方言、变体及言语障碍等维度,以及语音特征与社会经济、人口统计特征的交叉性方面。此外,对于底层训练数据——通常基于大规模网络爬取和/或公开可获取的语音资源——在数据收集伦理方面往往缺乏监管。为促进此类语音数据组件的标准化文档记录,我们提出了一种用于语音数据集的增强型数据表,该表可作为“数据集数据表”的补充。随后,基于对机器学习、语言学及健康等领域中语音数据使用的深入文献综述,我们例证了增强型数据表中各问题的重要性。最后,我们鼓励从数据集创建者到研究人员等从业者使用我们的增强型数据表,以更清晰地界定语音数据集的范围、属性及局限性,同时推动对数据主体保护与用户社区赋权的考量。伦理数据集创建并非“一刀切”的过程,但数据集创建者可借助我们的增强型数据表反思性考量相关SLT应用及数据源的社会语境,从而在下游催生更具包容性的SLT产品。