Student-teacher learning or knowledge distillation (KD) has been previously used to address data scarcity issue for training of speech recognition (ASR) systems. However, a limitation of KD training is that the student model classes must be a proper or improper subset of the teacher model classes. It prevents distillation from even acoustically similar languages if the character sets are not same. In this work, the aforementioned limitation is addressed by proposing a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors mapping approach. A pre-trained mapping model is used to map posteriors from a teacher language to the student language ASR. These mapped posteriors are used as soft labels for KD learning. Various teacher ensemble schemes are experimented to train an ASR model for low-resource languages. A model trained with MUST learning reduces relative character error rate (CER) up to 9.5% in comparison with a baseline monolingual ASR.
翻译:摘要:师生学习或知识蒸馏此前已被用于解决语音识别系统训练中的数据稀缺问题。然而,知识蒸馏训练的一个局限性在于,学生模型类别必须是教师模型类别的子集或真子集。这导致当字符集不一致时,即使声学相似的语言也无法进行蒸馏。本研究通过提出一种多语种师生学习方法来解决上述局限,该方法利用后验映射策略。一个预训练的映射模型用于将教师语言的ASR后验映射至学生语言。这些映射后的后验作为软标签用于知识蒸馏学习。实验探索了多种教师集成方案来训练低资源语言的ASR模型。与基线单语ASR相比,采用MUST学习训练的模型字符错误率相对降低了9.5%。