This paper proposes a self-regularised minimum latency training (SR-MLT) method for streaming Transformer-based automatic speech recognition (ASR) systems. In previous works, latency was optimised by truncating the online attention weights based on the hard alignments obtained from conventional ASR models, without taking into account the potential loss of ASR accuracy. On the contrary, here we present a strategy to obtain the alignments as a part of the model training without external supervision. The alignments produced by the proposed method are dynamically regularised on the training data, such that the latency reduction does not result in the loss of ASR accuracy. SR-MLT is applied as a fine-tuning step on the pre-trained Transformer models that are based on either monotonic chunkwise attention (MoChA) or cumulative attention (CA) algorithms for online decoding. ASR experiments on the AIShell-1 and Librispeech datasets show that when applied on a decent pre-trained MoChA or CA baseline model, SR-MLT can effectively reduce the latency with the relative gains ranging from 11.8% to 39.5%. Furthermore, we also demonstrate that under certain accuracy levels, the models trained with SR-MLT can achieve lower latency when compared to those supervised using external hard alignments.
翻译:本文提出了一种带自正则化的最小延迟训练(SR-MLT)方法,用于基于流式Transformer的自动语音识别(ASR)系统。在先前的工作中,延迟优化是通过基于传统ASR模型获得的硬对齐结果截断在线注意力权重来实现的,但未考虑ASR准确率的潜在损失。与此相反,本文提出一种策略,将对齐结果作为模型训练的一部分,无需外部监督。该方法产生的对齐结果在训练数据上得到动态正则化,使得延迟降低不会导致ASR准确率下降。SR-MLT作为微调步骤应用于基于单调分块注意力(MoChA)或累积注意力(CA)算法的预训练流式Transformer模型。在AIShell-1和Librispeech数据集上的ASR实验表明,当应用于一个良好的预训练MoChA或CA基线模型时,SR-MLT能有效降低延迟,相对提升幅度从11.8%到39.5%。此外,我们还证明在一定准确率水平下,使用SR-MLT训练的模型相比采用外部硬对齐监督训练的模型,可实现更低的延迟。