Compared to large speech foundation models, small distilled models exhibit degraded noise robustness. The student's robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Teacher behavior is learned by maximizing the teacher and student cross-correlation matrix between their representations towards identity. Noise robustness is encouraged via the student's self-correlation minimization. The proposed method is agnostic of the teacher model and consistently outperforms the previous approach. This work also proposes an heuristic to weigh the importance of the two correlation terms automatically. Experiments show consistently better clean and noise generalization on Intent Classification, Keyword Spotting, and Automatic Speech Recognition tasks on SUPERB Challenge.
翻译:与大型语音基础模型相比,小型蒸馏模型在噪声鲁棒性方面表现较差。通过在预训练期间向输入引入噪声,可以提升学生模型的鲁棒性。然而,使用标准蒸馏损失函数仍会导致学生模型性能下降。因此,本文提出通过基于相关性度量的蒸馏方法来增强学生模型的鲁棒性。通过最大化教师与学生表征之间的互相关矩阵趋向单位矩阵,使学生模型学习教师行为;同时通过最小化学生模型的自相关性来增强噪声鲁棒性。所提方法不依赖特定教师模型,且始终优于现有方法。本文还提出了一种启发式方法,可自动权衡两个相关性项的重要性。实验表明,在SUPERB挑战赛的意图分类、关键词唤醒和自动语音识别任务中,该方法在干净和噪声条件下的泛化性能均持续更优。