While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixture-of-experts LLM, i.e., generalist language model (GLaM). When the number of experts increases, GLaM dynamically selects only two at each decoding step to keep the inference computation roughly constant. We then apply GLaM to a multilingual shallow fusion task based on a state-of-the-art end-to-end model. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%. Compared to the baseline model, GLaM achieves an average WER reduction of 5.53% over 43 languages.
翻译:尽管大语言模型(LLM)在自然语言处理领域取得了显著进展,但其如何用于改进自动语音识别(ASR)仍不明确。在本工作中,我们提出训练单一多语言语言模型(LM),以在多种语言中实现浅融合。通过采用混合专家大语言模型(即通用语言模型GLaM)进行规模扩展,我们将多语言LM的覆盖范围推至84种语言。当专家数量增加时,GLaM在每一步解码中仅动态选择两位专家,以保持推理计算量大致恒定。随后,我们将GLaM应用于基于最先进端到端模型的多语言浅融合任务。与推理计算量相似的密集LM相比,GLaM在英文长尾测试集上实现了4.4%的相对词错误率(WER)降低。在多语言浅融合任务中,GLaM改善了50种语言中的41种,平均相对WER降低3.85%,最大降低幅度达10%。与基线模型相比,GLaM在43种语言上实现了平均5.53%的WER降低。