Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

翻译：近年来，自动语音识别系统的性能取得了显著提升，尤其对于那些拥有大量转录语音的语言。然而，对于低资源语言（如少数民族语言、区域语言或方言），自动语音识别性能通常仍远低于此。本研究探讨数据增强技术是否能帮助改善低资源自动语音识别性能，重点关注四种类型多样的少数民族语言或语言变体（西日耳曼语：格罗宁根语、西弗里斯语；马来-波利尼西亚语：贝塞马语、纳萨尔语）。针对所有四种语言，我们研究了自训练方法的应用，即先利用现有的人工转录数据训练自动语音识别系统，再用该系统生成转录文本，然后将这些生成数据与原始数据结合，训练新的自动语音识别系统。对于已有文本转语音系统的格罗宁根语，我们还研究了利用文本转语音从纯文本资源生成自动语音识别训练数据的方法。研究发现，采用自训练方法能持续提升性能（与基于24分钟人工转录语音训练的自动语音识别系统相比，相对词错误率降低高达20.5%）。对于格罗宁根语，文本转语音数据增强带来的性能提升更为显著（与基于24分钟人工转录语音的系统相比，相对词错误率降低高达25.5%）。总之，我们的结果表明，利用自训练或（可能的话）文本转语音生成的数据是克服资源稀缺语言数据限制、提升自动语音识别性能的有效方案。