Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large number of calls to the joint network, which were shown in previous work to be an important factor that reduces decoding speed. We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps, which results in 20%-96% decoding speedups consistently across all models and settings experimented with. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output, causing our algorithm to improve oracle word error rate by up to 11% relative as the segment size increases, and to slightly improve general word error rate.
翻译:标准循环神经网络换能器(RNN-T)在语音识别解码算法中沿时间轴逐帧迭代,即完成一个时间步的解码后才进入下一时间步。此类算法会导致对联合网络的大量调用,先前研究已证明这是降低解码速度的关键因素。本文提出一种批量化分段束搜索解码算法,通过将连续时间步的联合网络调用进行批处理,在所有实验模型与配置下持续实现20%-96%的解码加速。此外,将发射概率在时间段内聚合可视为对最可能模型输出的一种更优近似,随着分段长度增加,本算法可使理想词错误率相对降低达11%,并轻微改善通用词错误率。