Distilled self-supervised models have shown competitive performance and efficiency in recent years. However, there is a lack of experience in jointly distilling multiple self-supervised speech models. In our work, we performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective. On top of that, we proposed a multiple prediction head method for student models to predict different layer outputs of multiple teacher models simultaneously. The experimental results show that our method improves the performance of the distilled models on four downstream speech processing tasks, Phoneme Recognition, Speaker Identification, Emotion Recognition, and Automatic Speech Recognition in the hidden-set track of the SUPERB benchmark.
翻译:近年来,蒸馏后的自监督模型在性能和效率方面表现出色。然而,关于联合蒸馏多个自监督语音模型的经验仍较为缺乏。本研究针对HuBERT、RobustHuBERT及WavLM等多种自监督语音模型实施了集成知识蒸馏。我们尝试了两种不同的聚合技术——层级平均与层级拼接——用于整合不同教师模型的表征,并发现前者更为有效。在此基础上,我们提出了多预测头方法,使学生模型能够同时预测多个教师模型的不同层级输出。实验结果表明,该方法在SUPERB基准的隐藏集任务中,提升了蒸馏模型在四个下游语音处理任务(音素识别、说话人识别、情感识别及自动语音识别)上的性能。