Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher.
翻译:连接主义时序分类(CTC)模型是一种非常高效的序列建模方法,尤其适用于语音数据。为了将CTC模型用于自动语音识别(ASR)任务,需要结合外部语言模型(如n-gram语言模型)进行束搜索解码,才能获得合理的结果。本文深入分析了CTC束搜索中的空白标签,并提出了一种极其简单的方法来减少计算量,从而提升束搜索解码速度。通过该方法,我们在LibriSpeech数据集上实现了比普通束搜索解码最高78%的加速,且精度损失极小。我们不仅通过实验证明了该方法的有效性,还通过数学推理进行了理论验证。此外,我们观察到,当模型精度越高时,这种计算缩减效果越明显。