In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.
翻译:在本工作中,我们致力于解决为第二语言使用者构建公平英语自动语音识别(ASR)系统的挑战。我们对广泛使用的ASR模型(Whisper和Seamless-M4T)的分析表明,在26个口音群体中,词错误率(WER)存在巨大波动,揭示了显著的公平性差距。为缓解此问题,我们提出采用轻量级适配器进行公平性提示微调,该方法融合了谱解耦(SD)、组分布鲁棒优化(Group-DRO)以及不变风险最小化(IRM)。我们提出的方法将传统的基于交叉熵的经验风险最小化(ERM)与公平性驱动目标(SD、Group DRO和IRM)相结合,在保持整体识别准确性的同时,提升了跨口音群体的公平性。在宏观平均词错误率方面,相较于大规模预训练的Whisper和SeamlessM4T模型,我们的方法分别实现了58.7%和58.5%的相对提升;相较于使用标准交叉熵损失经验风险最小化进行微调的同一模型,则分别实现了9.7%和7.8%的相对提升。