In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
翻译:在本工作中,我们致力于解决为第二语言使用者构建公平的英语自动语音识别系统所面临的挑战。我们对广泛使用的ASR模型Whisper和Seamless-M4T的分析表明,在26种口音群体中,词错误率存在巨大波动,这揭示了显著的公平性差距。为缓解此问题,我们提出采用轻量级适配器的公平性提示微调方法,该方法融合了谱解耦、分组分布鲁棒优化以及不变风险最小化。我们提出的方法将传统的基于交叉熵的经验风险最小化与公平性驱动目标相结合,在保持整体识别精度的同时,显著提升了跨口音群体的公平性。就宏平均词错误率而言,相较于大型预训练模型Whisper和SeamlessM4T,我们的方法分别实现了58.7%和58.5%的相对提升;相较于使用标准交叉熵损失经验风险最小化进行微调的同一模型,则分别实现了9.7%和7.8%的相对提升。