Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.
翻译:持续学习(Continual Learning, CL)旨在利用新数据对预训练模型进行微调,同时保持其在预训练数据上的性能。这对于扩展多语言自动语音识别(Multilingual ASR, MASR)能力尤为重要。然而,现有的持续学习方法主要针对计算机视觉和强化学习任务设计,直接应用于多语言ASR时往往效果欠佳。我们推测,这是由于多语言ASR模型中自回归解码器的持续学习存在困难。为验证此假设,我们针对解码器提出了四项优化策略,包括:解码器层梯度手术、冻结未使用词元嵌入、抑制新增词元输出以及学习率重缩放。我们在将Whisper模型适配至Common Voice数据集中10种未见语言上的实验表明,相较于经验回放方法,这些优化策略将预训练语言的平均词错误率(Average Word Error Rate, AWER)从14.2%降至12.4%,且未损害新语言的词错误率表现。