Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.
翻译:基于注意力机制的编码器-解码器模型结合自回归(AR)解码因其优越的准确性,已成为自动语音识别(ASR)的主流方法。然而,这类模型通常面临推理速度慢的问题,这主要归因于解码器的增量计算。本文提出了一种部分自回归框架,采用段级向量化波束搜索来提升基于混合连接主义时序分类(CTC)注意力架构的ASR模型的推理速度。该方法首先通过贪婪CTC解码生成初始假设,并根据输出概率识别低置信度令牌;随后利用解码器对这些令牌执行段级向量化波束搜索,以最少的解码器计算量进行并行重新预测。实验结果表明,在LibriSpeech语料库上,我们的方法在保持高精度的同时,推理速度比AR解码快12至13倍。