End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.
翻译:端到端自动语音识别(E2E-ASR)可分为多种网络架构,例如联结主义时序分类(CTC)、循环神经网络转导器(RNN-T)、基于注意力机制的编码器-解码器以及掩码预测模型。每种网络架构各有优劣,导致实际应用中需根据需求在不同模型间切换。本文提出一种联合建模方案,使四个解码器(CTC、RNN-T、注意力机制和掩码预测)共享同一编码器——我们称之为4D建模。4D模型采用多任务学习进行训练,借助各任务的互补特性实现模型正则化并最大化模型鲁棒性。为高效训练4D模型,我们引入一种稳定多任务学习的两阶段训练策略。此外,我们提出三种结合三个解码器(CTC、RNN-T和注意力机制)的新型单遍束搜索算法以进一步提升性能。这三种束搜索算法的主要区别在于主解码器的选择。我们仔细评估了每种算法在性能与计算开销之间的权衡。实验结果表明,联合训练的4D模型优于仅使用单个解码器训练的E2E-ASR模型。同时,我们证明所提出的单遍束搜索算法性能优于先前提出的CTC/注意力联合解码方法。