The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in multilingual speech recognition. Recent studies have attempted to address this setting by separating the modules for different languages to ensure distinct latent representations for languages. Some other methods considered the switching mechanism based on language identification. In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. This method selects those attention heads in a model which closely express language identities and then guided those heads to be correctly attended with their corresponding languages. The experiments on the Mandarin-English code-switching speech corpus show that the proposed approach achieves a 14.2% mixed error rate, surpassing state-of-the-art method, where only 5.6% additional parameters over Whisper are trained.
翻译:强大的多语言模型(如Whisper)的普及极大推动了语音识别研究的发展。然而,这些模型在处理语码转换场景时往往表现不佳,而该场景在多语言语音识别中至关重要。近期研究尝试通过为不同语言分离模块以确保语言间存在独特的潜在表征,另一些方法则基于语言识别技术采用切换机制。本研究提出一种新型注意力引导自适应方法,以实现双语自动语音识别系统的参数高效学习。该方法通过筛选模型中与语言身份密切相关的注意力头,引导其正确关注对应语言。在中文-英语语码转换语音语料库上的实验表明,所提方法实现了14.2%的混合错误率,超越了现有最优方法,且仅需在Whisper基础上训练5.6%的额外参数。