The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.
翻译:近年来,自动语音识别系统的性能取得了显著进步,尤其对于拥有大量转录语音数据的语言而言。然而,对于低资源语言(如少数民族语言、区域方言或地方变体),其ASR性能通常仍远低于主流水平。本研究聚焦四种类型多样的少数民族语言或语言变体(西日耳曼语支:格罗宁根语、西弗里斯兰语;马来-波利尼西亚语支:贝塞玛语、纳萨尔语),探索数据增强技术是否有助于提升低资源ASR性能。针对所有四种语言,我们研究了自训练方法:利用现有的人工转录语音数据训练ASR系统,该系统生成的转录结果与原始数据结合后,用于训练新的ASR系统。对于已预建文本转语音系统的格罗宁根语,我们还探索了利用TTS从纯文本源生成ASR训练数据的方法。结果表明,自训练方法始终能带来性能提升(相较于基于24分钟人工转录语音训练的ASR系统,相对词错误率最高降低20.5%)。对于格罗宁根语,TTS数据增强带来的性能提升更为显著(相较于基于24分钟人工转录语音的系统,相对WER降低25.5%)。综上,我们的研究证实了自训练或(若可行)TTS生成数据作为高效解决方案的价值,可克服资源稀缺语言的数据可用性限制,从而提升ASR性能。