In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multi-level decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.
翻译:本文探索了多种训练序列到序列模型以集成预训练语言模型的新方案。所提出的融合方法聚焦于序列到序列解码器中长短期记忆网络(LSTM)的记忆单元状态与隐藏状态,且与先前研究不同之处在于通过语言模型更新记忆单元状态。这意味着主序列到序列模型保留的记忆将由外部语言模型进行调节。这些融合方法根据记忆单元更新的架构、以及直接影响最终标签推断的记忆单元与隐藏状态的使用方式,衍生出多种变体。我们通过实验验证了所提方法在单语自动语音识别场景(基于Librispeech语料库)以及从多语言自动语音识别基础模型向低资源语言迁移学习场景中的有效性。在Librispeech实验中,采用多层级解码策略时,我们最优模型相较于浅层融合基线,在test-clean和test-other数据集上的词错误率分别相对降低3.7%和2.4%。在从多语言语音识别基础模型向IARPA Babel斯瓦希里语模型迁移学习的实验中,最佳方案相较于两阶段迁移基线,在评估集上的字错误率与词错误率分别相对降低9.9%和9.8%。