Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss not only scales with input length but also varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the diverse ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 47.1% and the average error by up to 32.9%. CTC-DRO can be applied to ASR with minimal computational costs, and, while motivated by multilingual ASR, offers the potential for reducing group disparities in other domains with similar challenges.
翻译:现代深度学习模型通常能实现较高的整体性能,但在特定子组上却持续表现不佳。组分布鲁棒优化(group DRO)通过最小化最差组损失来解决此问题,但当组损失无法准确反映组间性能差异时,该方法便会失效。这在语音等领域很常见,其中广泛使用的连接时序分类(CTC)损失不仅随输入长度变化,还随语言学和声学特性而变化,导致组损失之间产生虚假差异。我们提出了CTC-DRO方法,它通过平滑组权重更新以防止对持续高损失组的过度关注,同时使用输入长度匹配的批处理来缓解CTC的尺度问题,从而解决了组DRO目标的缺陷。我们在多样化的ML-SUPERB 2.0基准测试中的五个语言集上,针对多语言自动语音识别(ASR)任务评估了CTC-DRO。CTC-DRO始终优于组DRO和基于CTC的基线模型,将最差语言错误率降低了高达47.1%,平均错误率降低了高达32.9%。CTC-DRO可以以最小的计算成本应用于ASR,并且,虽然其动机源于多语言ASR,但它也为其他面临类似挑战的领域减少组间差异提供了潜力。