Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.
翻译:现代预训练自监督自动语音识别模型通过大规模音频数据训练,将语音编码为上下文相关表征。然而,其训练数据严重偏向高资源语言,低资源语言数据稀缺,这引发了对类型学上罕见语音(如科伊桑语系中主要存在的咂舌辅音)可能被低估的担忧。这引出了我们的核心研究问题:这些模型能否像识别其他语音一样准确地识别咂舌辅音?为解决此问题,我们在两种富含咂舌音的科伊桑语言(G|ui和West !Xoon)数据上对预训练自监督语音模型(Wav2Vec2和HuBERT)进行微调与对比。结果表明,微调模型对咂舌音的识别准确率始终高于非咂舌音,表明自监督学习方法能够实现对包括罕见音素在内的人类语音的泛化识别。