Despite recent availability of large transcribed Kinyarwanda speech data, achieving robust speech recognition for Kinyarwanda is still challenging. In this work, we show that using self-supervised pre-training, following a simple curriculum schedule during fine-tuning and using semi-supervised learning to leverage large unlabelled speech data significantly improve speech recognition performance for Kinyarwanda. Our approach focuses on using public domain data only. A new studio-quality speech dataset is collected from a public website, then used to train a clean baseline model. The clean baseline model is then used to rank examples from a more diverse and noisy public dataset, defining a simple curriculum training schedule. Finally, we apply semi-supervised learning to label and learn from large unlabelled data in four successive generations. Our final model achieves 3.2% word error rate (WER) on the new dataset and 15.9% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the best of our knowledge. Our experiments also indicate that using syllabic rather than character-based tokenization results in better speech recognition performance for Kinyarwanda.
翻译:尽管近期已获得大量带标注的基尼亚卢旺达语语音数据,实现该语言的稳健语音识别仍具挑战性。本研究表明,采用自监督预训练、在微调期间遵循简单的课程学习策略,以及利用半监督学习方法挖掘大规模无标注语音数据,可显著提升基尼亚卢旺达语的语音识别性能。我们的方法仅使用公共领域数据——首先从公开网站收集一个高质量录音室语音数据集,用于训练干净的基线模型;随后利用该基线模型对来自更具多样性且噪声较大的公共数据集样本进行排序,定义简单的课程学习训练方案;最后,通过四代连续迭代,应用半监督学习对大规模无标注数据进行标注与学习。最终模型在新数据集上达到3.2%的词错误率(WER),在Mozilla Common Voice基准上达到15.9%的WER,据我们所知这一结果达到当前最优水平。实验还表明,使用音节而非字符级分词方法能有效提升基尼亚卢旺达语的语音识别性能。