Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance

Automatic speech recognition (ASR) models trained on large amounts of audio data are now widely used to convert speech to written text in a variety of applications from video captioning to automated assistants used in healthcare and other domains. As such, it is important that ASR models and their use is fair and equitable. Prior work examining the performance of commercial ASR systems on the Corpus of Regional African American Language (CORAAL) demonstrated significantly worse ASR performance on African American English (AAE). The current study seeks to understand the factors underlying this disparity by examining the performance of the current state-of-the-art neural network based ASR system (Whisper, OpenAI) on the CORAAL dataset. Two key findings have been identified as a result of the current study. The first confirms prior findings of significant dialectal variation even across neighboring communities, and worse ASR performance on AAE that can be improved to some extent with fine-tuning of ASR models. The second is a novel finding not discussed in prior work on CORAAL: differences in audio recording practices within the dataset have a significant impact on ASR accuracy resulting in a ``confounding by provenance'' effect in which both language use and recording quality differ by study location. These findings highlight the need for further systematic investigation to disentangle the effects of recording quality and inherent linguistic diversity when examining the fairness and bias present in neural ASR models, as any bias in ASR accuracy may have negative downstream effects on disparities in various domains of life in which ASR technology is used.

翻译：基于大量音频数据训练的自动语音识别（ASR）模型现已广泛应用于从视频字幕生成到医疗保健等领域自动化助手的多种场景中，其核心功能是将语音转换为书面文本。因此，确保ASR模型及其应用的公平性与公正性至关重要。先前针对商业ASR系统在区域性非裔美国人语言语料库（CORAAL）上的性能研究表明，这些系统对非裔美国人英语（AAE）的识别性能显著较差。本研究旨在通过考察当前最先进的基于神经网络的ASR系统（OpenAI的Whisper）在CORAAL数据集上的表现，深入探究造成这种差异的内在因素。本研究得出两个关键发现：首先，研究证实了先前关于方言变异性的结论——即使相邻社区之间也存在显著差异，同时确认了ASR对AAE的识别性能较差，但通过模型微调可在一定程度上改善该问题。其次，本研究揭示了一个先前CORAAL相关工作中未曾讨论的新发现：数据集内部音频录制实践的差异会显著影响ASR准确率，导致“来源混杂”效应——即语言使用习惯与录音质量均随研究地点不同而产生系统性差异。这些发现强调，在考察神经ASR模型中存在的公平性与偏见问题时，需要进一步开展系统性研究以区分录音质量与内在语言多样性的影响。鉴于ASR准确率中存在的任何偏差，都可能对ASR技术应用的各个生活领域产生加剧不平等的负面连锁效应。