The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.
翻译:自动语音识别系统的普及促进了对种族、年龄、性别和口音等人口统计偏见的探索,这些偏见常源于不平衡的训练数据。大多数研究聚焦于标准字素基自动语音识别系统,而对音素基系统(如生成国际音标表示的模型)的关注相对较少。随着自动语音识别系统向多语言支持和低资源语言建模发展,音标层作为关键的、与语言无关的基础架构发挥重要作用。本研究评估了WhisperIPA和ZIPA这两个生成跨多种口音和语言源语音转录的先进开源音素基自动语音识别系统。评估涵盖现有多语言语音语料库及标注人口统计信息的英语语料库。通过将模型生成的音标转录与字素到音素系统进行对比,我们采用标准音素错误率和提出的软音素错误率指标衡量模型性能,后者允许容忍语言学上相似音素的替换。分析考察了不同语言和人口统计群体(如性别、口音、种族和年龄)间的性能差异,揭示即使考虑可接受的音位变异后仍存在的持续偏差。这些发现为偏差潜在来源提供了洞见,并推动开发更具包容性和语言鲁棒性的音素基自动语音识别系统。我们的代码和数据将公开向社区发布。