Continued self-supervised (SSL) pre-training for adapting existing SSL models to the target domain has shown to be extremely effective for low-resource Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a simple and novel approach for SSL-based continued pre-training that boosts ASR performance in the target domain where both labeled and unlabeled data are limited. Stable Distillation employs self-distillation as regularization for continued pre-training, alleviating the over-fitting issue, a common problem continued pre-training faces when the source and target domains differ. Specifically, first, we perform vanilla continued pre-training on an initial SSL pre-trained model on the target domain ASR dataset and call it the teacher. Next, we take the same initial pre-trained model as a student to perform continued pre-training while enforcing its hidden representations to be close to that of the teacher (via MSE loss). This student is then used for downstream ASR fine-tuning on the target dataset. In practice, Stable Distillation outperforms all our baselines by 0.8 - 7 WER when evaluated in various experimental settings.
翻译:持续的自监督预训练通过适配现有SSL模型至目标领域,已被证明对低资源自动语音识别(ASR)具有显著效果。本文提出稳定蒸馏(Stable Distillation),一种用于基于SSL的继续预训练的简洁新颖方法,可在标注数据与无标注数据均受限的目标领域提升ASR性能。该方法采用自蒸馏作为继续预训练的正则化手段,有效缓解了源领域与目标领域差异导致的过拟合问题——这是继续预训练中常见的困境。具体而言,首先我们在初始SSL预训练模型上对目标领域ASR数据集执行标准继续预训练,并将其称为教师模型;随后,将该初始预训练模型作为学生模型进行继续预训练,同时通过均方误差损失强制其隐藏表示逼近教师模型。此学生模型最终用于目标数据集上的下游ASR微调。实验表明,在不同设置下,稳定蒸馏在各项评估中均实现了比所有基线模型低0.8至7个词错误率(WER)的性能提升。