Knowledge distillation has widely been used for model compression and domain adaptation for speech applications. In the presence of multiple teachers, knowledge can easily be transferred to the student by averaging the models output. However, previous research shows that the student do not adapt well with such combination. This paper propose to use an elitist sampling strategy at the output of ensemble teacher models to select the best-decoded utterance generated by completely out-of-domain teacher models for generalizing unseen domain. The teacher models are trained on AMI, LibriSpeech and WSJ while the student is adapted for the Switchboard data. The results show that with the selection strategy based on the individual models posteriors the student model achieves a better WER compared to all the teachers and baselines with a minimum absolute improvement of about 8.4 percent. Furthermore, an insights on the model adaptation with out-of-domain data has also been studied via correlation analysis.
翻译:知识蒸馏已广泛应用于语音应用的模型压缩和领域自适应。在存在多个教师模型的情况下,通过平均模型输出可以轻松将知识迁移至学生模型。然而,先前研究表明学生模型在此类组合下的适应效果不佳。本文提出在集成教师模型输出端采用精英采样策略,从完全域外的教师模型中选取最优解码语句,以实现对未见过领域的泛化。教师模型基于AMI、LibriSpeech和WSJ数据集训练,学生模型则针对Switchboard数据进行自适应。结果表明,基于各模型后验概率的选择策略使学生模型在词错误率(WER)上优于所有教师模型及基线模型,绝对改进率最低达8.4%。此外,通过相关性分析,本研究还对域外数据下的模型自适应机制进行了深入探讨。