Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
翻译:Conformer最近被提出作为自动语音识别(ASR)的一种有前景的建模方法,其性能优于基于循环神经网络的方法和Transformer。然而,总体而言,这些端到端模型,尤其是基于注意力的模型,在处理长语音时性能会显著下降。为解决这一局限性,我们提出在Conformer的编码器和解码器之间添加一个完全可微的记忆增强神经网络。这种外部记忆能够丰富模型对较长语音的泛化能力,因为它使系统能够循环地存储和检索更多信息。值得注意的是,我们探索了神经图灵机(NTM),从而提出了用于ASR的Conformer-NTM模型架构。使用Librispeech的train-clean-100和train-960数据集进行的实验结果表明,所提出的系统在处理长语音时优于无记忆的基线Conformer模型。