The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in multilingual speech recognition. Recent studies have attempted to address this setting by separating the modules for different languages to ensure distinct latent representations for languages. Some other methods considered the switching mechanism based on language identification. In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. This method selects those attention heads in a model which closely express language identities and then guided those heads to be correctly attended with their corresponding languages. The experiments on the Mandarin-English code-switching speech corpus show that the proposed approach achieves a 14.2% mixed error rate, surpassing state-of-the-art method, where only 5.6% additional parameters over Whisper are trained.
翻译:多语言模型(如Whisper)的广泛应用显著推进了语音识别领域的研究进展。然而,这些模型在处理语码转换场景时仍面临挑战,而该场景是多语言语音识别的核心问题。近期研究试图通过分离不同语言的模块来确保各语言具有独特的潜在表征,另有方法基于语言识别设计切换机制。本研究提出一种新的注意力引导自适应方法,用于实现双语自动语音识别中的参数高效学习。该方法选择模型中能够密切表达语言身份的注意力头,并引导这些注意力头正确关注其对应语言。在汉英混合语码转换语音语料库上的实验结果表明,所提方法实现了14.2%的混合错误率,超越了现有最优方法,且仅需在Whisper基础上训练5.6%的额外参数。