End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate which chooses the best two experts among many in forward propagation. The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases. We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline. Compared to an adapter model using ground truth information, our MoE model achieves similar WER and activates similar number of parameters but without any language information. We further show around 3% relative WER improvement by multilingual shallow fusion.
翻译:具有大容量的端到端模型显著提升了多语言自动语音识别的性能,但其计算成本对设备端应用构成挑战。我们提出一种流式真多语言卷积网络,结合混合专家层,该网络在训练和推理过程中仅学习激活参数子集。混合专家层包含一个softmax门控,在前向传播中从众多专家中选择最佳的两个专家。所提出的混合专家层通过激活固定数量的参数(随专家数量增加而保持不变)实现高效推理。我们在包含12种语言的数据集上评估所提模型,相比基线模型,词错误率平均相对降低11.9%。与使用真实语言信息的适配器模型相比,我们的混合专家模型在无任何语言信息条件下实现了相似的词错误率并激活了相似数量的参数。我们进一步通过多语言浅融合实现了约3%的词错误率相对提升。