In this paper, we propose a language-universal adapter learning framework based on a pre-trained model for end-to-end multilingual automatic speech recognition (ASR). For acoustic modeling, the wav2vec 2.0 pre-trained model is fine-tuned by inserting language-specific and language-universal adapters. An online knowledge distillation is then used to enable the language-universal adapters to learn both language-specific and universal features. The linguistic information confusion is also reduced by leveraging language identifiers (LIDs). With LIDs we perform a position-wise modification on the multi-head attention outputs. In the inference procedure, the language-specific adapters are removed while the language-universal adapters are kept activated. The proposed method improves the recognition accuracy and addresses the linear increase of the number of adapters' parameters with the number of languages in common multilingual ASR systems. Experiments on the BABEL dataset confirm the effectiveness of the proposed framework. Compared to the conventional multilingual model, a 3.3% absolute error rate reduction is achieved. The code is available at: https://github.com/shen9712/UniversalAdapterLearning.
翻译:本文提出一种基于预训练模型的端到端多语言自动语音识别(ASR)语言通用适配器学习框架。在声学建模方面,通过插入语言特定适配器和语言通用适配器对wav2vec 2.0预训练模型进行微调。随后采用在线知识蒸馏技术,使语言通用适配器能够同时学习语言特定特征与通用特征。通过利用语言标识符(LIDs)进一步降低了语言信息混淆。我们利用LIDs对多头注意力输出进行位置级修正。推理阶段,语言特定适配器被移除,而语言通用适配器保持激活状态。所提方法提升了识别准确率,并解决了常见多语言ASR系统中适配器参数随语种数量线性增长的问题。在BABEL数据集上的实验证实了该框架的有效性。与传统多语言模型相比,实现了3.3%的绝对错误率降低。代码开源地址为:https://github.com/shen9712/UniversalAdapterLearning。