Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

翻译：近年来，自动语音识别系统的性能取得了显著进步，尤其对于拥有大量转录语音数据的语言而言。然而，对于低资源语言（如少数民族语言、区域方言或地方变体），其ASR性能通常仍远低于主流水平。本研究聚焦四种类型多样的少数民族语言或语言变体（西日耳曼语支：格罗宁根语、西弗里斯兰语；马来-波利尼西亚语支：贝塞玛语、纳萨尔语），探索数据增强技术是否有助于提升低资源ASR性能。针对所有四种语言，我们研究了自训练方法：利用现有的人工转录语音数据训练ASR系统，该系统生成的转录结果与原始数据结合后，用于训练新的ASR系统。对于已预建文本转语音系统的格罗宁根语，我们还探索了利用TTS从纯文本源生成ASR训练数据的方法。结果表明，自训练方法始终能带来性能提升（相较于基于24分钟人工转录语音训练的ASR系统，相对词错误率最高降低20.5%）。对于格罗宁根语，TTS数据增强带来的性能提升更为显著（相较于基于24分钟人工转录语音的系统，相对WER降低25.5%）。综上，我们的研究证实了自训练或（若可行）TTS生成数据作为高效解决方案的价值，可克服资源稀缺语言的数据可用性限制，从而提升ASR性能。