Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online. However, continued pre-training on out-of-domain distributions often leads to catastrophic forgetting of previously acquired knowledge, leading to sub-optimal ASR performance. This paper presents FusDom, a simple and novel methodology for SSL-based continued pre-training. FusDom learns speech representations that are robust and adaptive yet not forgetful of concepts seen in the past. Instead of solving the SSL pre-text task on the output representations of a single model, FusDom leverages two identical pre-trained SSL models, a teacher and a student, with a modified pre-training head to solve the CP SSL pre-text task. This head employs a cross-attention mechanism between the representations of both models while only the student receives gradient updates and the teacher does not. Finally, the student is fine-tuned for ASR. In practice, FusDom outperforms all our baselines across settings significantly, with WER improvements in the range of 0.2 WER - 7.3 WER in the target domain while retaining the performance in the earlier domain.
翻译:摘要:持续预训练(CP)具有多重优势,例如目标领域适应能力以及利用在线可用未标注数据流的潜力。然而,在领域外分布上进行持续预训练往往会导致先前获得知识的灾难性遗忘,从而造成语音识别(ASR)性能次优。本文提出FusDom——一种基于自监督学习(SSL)的简单新颖持续预训练方法。FusDom学习的语音表示既鲁棒又具适应性,且不会遗忘先前遇到的概念。该方法并非在单个模型的输出表示上解决SSL前置任务,而是利用两个相同的预训练SSL模型(教师模型与学生模型),通过修改预训练头部来解决持续预训练中的SSL前置任务。该头部在两个模型的表示之间采用交叉注意力机制,仅学生模型接收梯度更新而教师模型不参与更新。最终,学生模型被微调用于ASR任务。实验表明,FusDom在所有设置下均显著优于基线,在目标领域上词错误率(WER)降低幅度达0.2-7.3个百分点,同时保留了先前领域的性能。