We improve low-resource ASR by integrating the ideas of multilingual training and self-supervised learning. Concretely, we leverage an International Phonetic Alphabet (IPA) multilingual model to create frame-level pseudo labels for unlabeled speech, and use these pseudo labels to guide hidden-unit BERT (HuBERT) based speech pretraining in a phonetically-informed manner. The experiments on the Multilingual Speech (MLS) Corpus show that the proposed approach consistently outperforms the standard HuBERT on all the target languages. Moreover, on 3 of the 4 languages, comparing to the standard HuBERT, the approach performs better, meanwhile is able to save supervised training data by 1.5k hours (75%) at most. Our approach outperforms most of the state of the arts, with much less pretraining data in terms of hours and language diversity. Compared to XLSR-53 and a retraining based multilingual method, our approach performs better with full and limited finetuning data scenarios.
翻译:我们通过整合多语言训练和自监督学习的思想,改进了低资源自动语音识别(ASR)的性能。具体而言,我们利用基于国际音标(IPA)的多语言模型为未标注语音生成帧级伪标签,并以此伪标签指导基于隐藏单元BERT(HuBERT)的语音预训练,使其具有语音学感知特性。在多语言语音(MLS)语料库上的实验表明,所提方法在所有目标语言上均持续优于标准HuBERT。此外,在4种语言中的3种上,该方法相比标准HuBERT不仅表现更优,最多还可节省1.5k小时(75%)的监督训练数据。该方法在预训练数据时长和语言多样性远低于现有方法的情况下,仍超越了大多数当前最优成果。与XLSR-53及一种基于重训练的多语言方法相比,本方法在全量及有限微调数据场景下均表现更佳。