The current public datasets for speech recognition (ASR) tend not to focus specifically on the fairness aspect, such as performance across different demographic groups. This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. Our dataset includes approximately 26.5K utterances in recorded speech by 593 people in the United States, who were paid to record and submit audios of themselves saying voice commands. We also provide ASR baselines, including on models trained on transcribed and untranscribed social media videos and open source models.
翻译:当前公开的语音识别(ASR)数据集通常未专门关注公平性方面,例如在不同人口统计群体间的性能差异。本文介绍了一个新颖的数据集Fair-Speech,这是一个公开发布的语料库,旨在帮助研究者评估其ASR模型在多样化自报告人口统计信息(如年龄、性别、种族、地域差异以及参与者是否自认为英语母语者)上的准确性。我们的数据集包含美国593名参与者录制的约26.5千条语音指令,参与者通过有偿方式录制并提交其语音指令的音频。我们还提供了ASR基线结果,包括基于转录与非转录社交媒体视频训练的模型以及开源模型的性能基准。