Automatic Speech Recognition (ASR) systems' growing use warrants robust auditing approaches to ensure equitable transcription quality, especially for people with speech disorders like aphasia who disproportionately depend on ASR. While academic and industry audits have revealed performance disparities across user populations, standard auditing practices often overlook nuances that risk masking harm to marginalized groups. We identify three common pitfalls in standard ASR audits: (1) adhering to one method of text standardization, which can mask variance in ASR performance and ignore the standardization preferences of marginalized communities; (2) displaying high-level demographic findings without considering performance disparities by nuanced intersectional subgroups, or conditioning on relevant acoustic properties; and (3) reporting only one gold-standard metric (Word Error Rate), which inadequately quantifies common generative AI errors like hallucinations. We propose a holistic auditing framework addressing these pitfalls, and in a case study of six popular ASR systems, find consistently worse ASR performance for speakers with aphasia relative to a control group. We call on practitioners to implement these robust, community-driven ASR auditing practices better suited for the rapidly changing ASR landscape.
翻译:随着自动语音识别(ASR)系统的广泛应用,亟需建立稳健的审计方法确保转录质量的公平性,特别是针对失语症等语言障碍群体——他们过度依赖ASR系统。尽管学术界与工业界的审计已揭示不同用户群体间的性能差异,但标准审计实践常忽视那些可能掩盖对边缘群体伤害的细微差别。我们识别了标准ASR审计中的三大常见陷阱:(1)固守单一文本标准化方法,这既可能掩盖ASR性能差异,又忽视了边缘群体的标准化偏好;(2)仅呈现高层级人口统计发现,而未考虑交叉性子群体间的性能差异或相关声学特征;(3)仅报告单一黄金标准指标(词错误率),这无法充分量化常见生成式AI错误(如幻觉)。我们提出一个覆盖上述陷阱的整体审计框架,并通过六种主流ASR系统的案例研究,发现失语症患者的ASR性能始终低于对照组。我们呼吁从业者采用这些更适应快速变化的ASR生态系统的、基于社区驱动的稳健审计实践。