With the massive developments of end-to-end (E2E) neural networks, recent years have witnessed unprecedented breakthroughs in automatic speech recognition (ASR). However, the codeswitching phenomenon remains a major obstacle that hinders ASR from perfection, as the lack of labeled data and the variations between languages often lead to degradation of ASR performance. In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the codeswitching phenomenon. Our main contributions are threefold: First, we introduce a novel disentanglement loss to enable the lower-layer of the encoder to capture inter-lingual acoustic information while mitigating linguistic confusion at the higher-layer of the encoder. Second, through comprehensive experiments, we verify that our proposed method outperforms the prior-art methods using pretrained dual-encoders, meanwhile having access only to the codeswitching corpus and consuming half of the parameterization. Third, the apparent differentiation of the encoders' output features also corroborates the complementarity between the disentanglement loss and the mixture-of-experts (MoE) architecture.
翻译:随着端到端(E2E)神经网络的快速发展,近年来自动语音识别(ASR)领域取得了前所未有的突破。然而,语码混合现象仍然是阻碍ASR臻于完善的主要障碍,因为标注数据的缺乏以及语言间的差异常导致ASR性能下降。本文专注于改进E2E ASR的声学编码器,以应对语码混合现象带来的挑战。我们的主要贡献有三方面:第一,我们引入了一种新颖的解耦损失,使编码器低层能够捕捉跨语言的声学信息,同时缓解编码器高层的语言混淆;第二,通过全面的实验,我们验证了所提出的方法优于使用预训练双编码器的现有技术,且仅需访问语码混合语料库并消耗一半的参数化量;第三,编码器输出特征的明显差异也证实了解耦损失与混合专家(MoE)架构之间的互补性。