Many factors have separately shown their effectiveness on improving multilingual ASR. They include language identity (LID) and phoneme information, language-specific processing modules and cross-lingual self-supervised speech representation, etc. However, few studies work on synergistically combining them to contribute a unified solution, which still remains an open question. To this end, a novel view to incorporate hierarchical information path LUPET into multilingual ASR is proposed. The LUPET is a path encoding multiple information in different granularity from shallow to deep encoder layers. Early information in this path is beneficial for deriving later occurred information. Specifically, the input goes from LID prediction to acoustic unit discovery followed by phoneme sharing, and then dynamically routed by mixture-of-expert for final token recognition. Experiments on 10 languages of Common Voice examined the superior performance of LUPET. Importantly, LUPET significantly boosts the recognition on high-resource languages, thus mitigating the compromised phenomenon towards low-resource languages in a multilingual setting.
翻译:多种因素已分别证明其在提升多语言自动语音识别(ASR)性能方面的有效性,包括语言身份(LID)与音素信息、语言专用处理模块及跨语言自监督语音表示等。然而,鲜有研究致力于协同整合这些因素以形成统一解决方案,这仍是一个有待探索的开放性问题。为此,本文提出了一种将层级信息路径LUPET融入多语言ASR的新视角。LUPET是一条编码从浅层到深层编码器层中不同粒度信息的路径,其早期信息有助于推导后续信息。具体而言,输入依次经历LID预测、声学单元发现、音素共享,并通过混合专家模型(Mixture-of-Expert)动态路由以完成最终的词识别。在Common Voice数据集上对10种语言的实验验证了LUPET的卓越性能。重要的是,LUPET显著提升了高资源语言的识别效果,从而缓解了多语言环境下低资源语言性能受损的现象。