This paper addresses the challenge of integrating low-resource languages into multilingual automatic speech recognition (ASR) systems. We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. We fine-tune the Whisper multilingual ASR model on five high-resource languages and one low-resource language, employing language-weighted dynamic cross-entropy and data augmentation. The results show a remarkable 6.69% word error rate (WER) reduction for the low-resource language compared to the fine-tuned model without applying our approach, and a 48.86% WER reduction compared to the original Whisper model. In addition, our approach yields an average WER reduction of 3.29% across the six languages, showing no degradation for the high-resource languages.
翻译:本文探讨了将低资源语言整合到多语言自动语音识别(ASR)系统中的挑战。我们引入了一种新颖的加权交叉熵应用方法,该方法通常用于处理不平衡数据集,以促进在持续多语言学习背景下将低资源语言整合到预训练的多语言ASR模型中。我们在五种高资源语言和一种低资源语言上对Whisper多语言ASR模型进行微调,采用了语言加权的动态交叉熵和数据增强技术。结果表明,与未应用我们方法的微调模型相比,低资源语言的词错误率(WER)显著降低了6.69%;与原始Whisper模型相比,WER降低了48.86%。此外,我们的方法在六种语言上平均实现了3.29%的WER降低,且高资源语言的性能未出现下降。