Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover, the well-trained LAE-ST-MoE model can perform ST tasks from CS speech to Mandarin or English text.
翻译:摘要:近年来,为缓解代码混合自动语音识别中不同语言间的混淆问题,条件分解模型(如语言感知编码器)显式忽略了不同语言间的上下文信息。然而,该信息可能对语音识别建模有益。为解决此问题,我们提出LAE-ST-MoE框架。该框架将语音翻译任务融入LAE,并利用ST学习不同语言间的上下文信息。它引入基于任务的混合专家模块,为ASR和ST任务分别采用独立的前馈网络。在ASRU 2019中英混合语音挑战数据集上的实验结果表明,与基于LAE的CTC模型相比,LAE-ST-MoE模型在相同解码参数下,对代码混合测试集的混合错误率降低了9.26%。此外,训练完成的LAE-ST-MoE模型可实现从代码混合语音到中文或英文文本的翻译任务。