Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+LibriSpeech460 Mandarin-English code-switching ASR datasets.
翻译:语码转换自动语音识别旨在准确转录包含两种或更多语言的语音。为更好地捕捉语言特定的语音表征并解决语码转换语音识别中的语言混淆问题,专家混合架构与额外的语言日记解码器常被采用。然而,多数研究仍停滞于加权求和或拼接等简单操作来融合语言特定语音表征,在整合语言偏置信息的增强方面存在大量探索空间。本文提出CAMEL,一种基于交叉注意力的专家混合与语言偏置方法用于语码转换语音识别。具体而言,在每个MoE层后,我们利用交叉注意力强大的上下文建模能力融合语言特定语音表征。此外,我们设计了一种基于源注意力的机制,将来自LD解码器输出的语言信息整合到文本嵌入中。实验结果表明,我们的方法在SEAME、ASRU200及ASRU700+LibriSpeech460普通话-英语语码转换语音识别数据集上实现了最先进的性能。