Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. That is, the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
翻译:当前自动语音识别(ASR)模型被设计为无需重大修改即可适用于多种语言和任务。然而,这种广泛的语言覆盖隐藏了语言内部的性能差异,例如跨性别的差异。本研究系统评估了两个广泛使用的多语言ASR模型在三个数据集上的性能,涵盖八个语系的19种语言及两种说话条件。研究结果揭示了明显的性别差异,且优势群体因语言和模型而异。令人惊讶的是,这些差异无法通过声学或词汇特性解释。然而,通过探测模型内部状态,我们发现其与性别化性能差异存在相关性:即通过探针越容易区分某种语言中说话者的性别,该语言的性别差异就越小(且更有利于女性说话者)。我们的结果表明,即使在最先进的模型中,性别差异依然存在。这些发现对改进多语言ASR系统具有重要意义,强调了训练数据可及性以及精细化评估对预测和缓解性别差异的关键作用。所有代码与实验材料发布于 https://github.com/g8a9/multilingual-asr-gender-gap。