In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. To further utilize the language information embedded in text, we also incorporate MoE layers into the decoder of SC-MoE. In addition, we introduce routers into every MoE layer of the encoder and the decoder and achieve better recognition performance. Experimental results show that the SC-MoE significantly improves CS ASR performances over baseline with comparable computational efficiency.
翻译:在本工作中,我们提出了一种基于Switch Conformer的专家混合(MoE)系统,命名为SC-MoE,用于统一的流式与非流式语码转换(CS)自动语音识别(ASR)。我们设计了一个流式MoE层,该层包含三个语言专家,分别对应普通话、英语和空白,并在SC-MoE编码器中配备了一个以连接时序分类(CTC)损失作为路由器的语言识别(LID)网络,以实现实时流式CS ASR系统。为了进一步利用文本中嵌入的语言信息,我们还将MoE层整合到SC-MoE的解码器中。此外,我们在编码器和解码器的每个MoE层中引入了路由器,从而获得了更好的识别性能。实验结果表明,SC-MoE在保持相当计算效率的同时,显著提升了CS ASR性能,优于基线模型。