Considerations for Ethical Speech Recognition Datasets

Speech AI Technologies are largely trained on publicly available datasets or by the massive web-crawling of speech. In both cases, data acquisition focuses on minimizing collection effort, without necessarily taking the data subjects' protection or user needs into consideration. This results to models that are not robust when used on users who deviate from the dominant demographics in the training set, discriminating individuals having different dialects, accents, speaking styles, and disfluencies. In this talk, we use automatic speech recognition as a case study and examine the properties that ethical speech datasets should possess towards responsible AI applications. We showcase diversity issues, inclusion practices, and necessary considerations that can improve trained models, while facilitating model explainability and protecting users and data subjects. We argue for the legal & privacy protection of data subjects, targeted data sampling corresponding to user demographics & needs, appropriate meta data that ensure explainability & accountability in cases of model failure, and the sociotechnical \& situated model design. We hope this talk can inspire researchers \& practitioners to design and use more human-centric datasets in speech technologies and other domains, in ways that empower and respect users, while improving machine learning models' robustness and utility.

翻译：语音人工智能技术主要依赖公开可用数据集或大规模网络爬取语音数据进行训练。在这两种情况下，数据获取侧重于最小化采集工作量，而未必考虑数据主体的保护或用户需求。这导致模型在用于偏离训练集中主流群体特征的用户时缺乏鲁棒性，对具有不同方言、口音、说话风格及言语不流畅的个体产生歧视。本报告以自动语音识别为案例，研究伦理语音数据集应具备的属性，以推动负责任的人工智能应用。我们展示了多样性问题、包容性实践及必要的考量要素，这些要素既能改善训练模型，又能促进模型可解释性并保护用户与数据主体。我们主张：对数据主体进行法律与隐私保护；根据用户人口统计特征与需求进行针对性数据采样；配备适当的元数据以确保模型失效情况下的可解释性与可问责性；以及社会技术情境化模型设计。我们希望本报告能激励研究人员与从业者在语音技术及其他领域设计并使用更以人为中心的数据集，以赋能和尊重用户的方式，同时提升机器学习模型的鲁棒性和实用性。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/