Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

The end-to-end ASR model is often desired in the streaming multilingual scenario since it is easier to deploy and can benefit from pre-trained speech models such as powerful foundation models. Meanwhile, the heterogeneous nature and imbalanced data abundance of different languages may cause performance degradation, leading to asynchronous peak performance for different languages during training, especially on tail ones. Sometimes even the data itself may become unavailable as a result of the enhanced privacy protection. Existing work tend to significantly increase the model size or learn language-specific decoders to accommodate each language separately. In this study, we explore simple yet effective Language-Dependent Adapter (LDA) finetuning under a cascaded Conformer transducer framework enhanced by teacher pseudo-labeling for tail languages in the streaming multilingual ASR. The adapter only accounts for 0.4% of the full model per language. It is plugged into the frozen foundation model and is the only trainable module during the finetuning process with noisy student training. The final model merges the adapter parameters from different checkpoints for different languages. The model performance is validated on a challenging multilingual dictation dataset, which includes 39 tail languages across Latin, Greek, Arabic, etc. Our proposed method brings 12.2% word error rate reduction on average and up to 37.5% on a single locale. Furthermore, we show that our parameter-efficient LDA can match the quality of the full model finetuning, thus greatly alleviating the asynchronous peak performance issue.

翻译：端到端ASR模型在流式多语言场景中备受青睐，因其易于部署且能受益于预训练语音模型（如强大的基础模型）。然而，不同语言的异构性和数据分布不均衡可能导致性能下降，使得训练过程中各语言（尤其是尾语言）的峰值性能出现异步。甚至随着隐私保护增强，某些语言的数据本身可能不可用。现有工作往往大幅增加模型规模或学习语言专用解码器以分别适配每种语言。本研究探索了一种简单高效的语言相关适配器（LDA）微调方法，该方法基于级联Conformer换能器框架，并通过教师伪标签技术增强，专门针对流式多语言ASR中的尾语言。每个语言的适配器参数仅占完整模型的0.4%。适配器嵌入冻结的基础模型中，并在带噪学生训练的微调过程中作为唯一可训练模块。最终模型将不同语言对应检查点的适配器参数进行合并。我们在一个包含拉丁语、希腊语、阿拉伯语等39种尾语言的多语种听写数据集上验证了模型性能。所提方法平均降低12.2%的词错误率，单个语言最高降低37.5%。此外，我们证明参数高效的LDA可达到与完整模型微调相当的质量，从而显著缓解异步峰值性能问题。