End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.
翻译:端到端模型,尤其是循环神经网络换能器(RNN-T),在语音识别中取得了巨大成功。然而,换能器在处理长解码序列时需要大量的内存和计算时间。为解决这一问题,我们提出了一种名为时间稀疏换能器的模型,该模型将时间稀疏机制引入换能器中。在此机制中,我们通过降低隐藏状态的时间分辨率来获取中间表示,随后使用加权平均算法将这些表示组合成稀疏隐藏状态,再送入解码器。所有实验均在中文数据集AISHELL-1上进行。与RNN-T相比,时间稀疏换能器的字符错误率接近RNN-T,而实时因子仅为原始值的50.00%。通过调整时间分辨率,时间稀疏换能器还可将实时因子降至原始值的16.54%,同时精度损失4.94%。