面向第二语言使用者的公平ASR：基于公平性提示微调的方法 (Towards Fair ASR For Second Language Speakers Using Fairness Prompted Finetuning)

In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.

翻译：在本工作中，我们致力于解决为第二语言使用者构建公平英语自动语音识别（ASR）系统的挑战。我们对广泛使用的ASR模型（Whisper和Seamless-M4T）的分析表明，在26个口音群体中，词错误率（WER）存在巨大波动，揭示了显著的公平性差距。为缓解此问题，我们提出采用轻量级适配器进行公平性提示微调，该方法融合了谱解耦（SD）、组分布鲁棒优化（Group-DRO）以及不变风险最小化（IRM）。我们提出的方法将传统的基于交叉熵的经验风险最小化（ERM）与公平性驱动目标（SD、Group DRO和IRM）相结合，在保持整体识别准确性的同时，提升了跨口音群体的公平性。在宏观平均词错误率方面，相较于大规模预训练的Whisper和SeamlessM4T模型，我们的方法分别实现了58.7%和58.5%的相对提升；相较于使用标准交叉熵损失经验风险最小化进行微调的同一模型，则分别实现了9.7%和7.8%的相对提升。