Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.
翻译:现代自动语音识别(ASR)系统尽管在整体性能上有所提升,但人们观察到其对某些说话人群体(SG)的表现优于其他群体。阻碍ASR公平性进展的一个潜在因素,在于缺乏对语音编码器模型所产生错误类型的更细致理解,特别是高性能与低性能SG嵌入结构之间的差异。本文提出一个框架,用于定型ASR系统中音素建模可能出现的两类错误:音素嵌入中的随机错误/高方差,与系统性错误/嵌入偏差。我们发现,仅针对单一、典型的弱势SG训练音素分类探针,有时能提升该SG的性能,这证明了SG级别音素嵌入偏差的存在。另一方面,我们发现音素方差较高的说话人和SG,与音素预测精度较差的群体是相同的。我们得出结论:音素嵌入中同时存在两类错误,且两者都是导致ASR中SG级别不公平性的潜在原因,尽管随机错误对公平性的阻碍可能大于系统性错误。此外,我们发现使用公平性增强算法(领域增强与对抗训练)微调编码器模型,既未改变领域内音素分类探针训练带来的益处,也未改变测得的随机嵌入错误水平。