RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.
翻译:RNN-T模型在自动语音识别领域被广泛采用,其依赖RNN-T损失函数实现输入音频与目标序列的长度对齐。然而,RNN-T损失函数的实现复杂性及其基于对齐的优化目标,分别导致了计算冗余和预测器网络作用受限的问题。本文提出了一种名为CIF-转导器(CIF-T)的新型模型,该模型将连续整合触发(CIF)机制与RNN-T模型相结合,以实现高效对齐。通过这种方式,我们摒弃了RNN-T损失函数,从而降低了计算量,并使预测器网络能够发挥更重要的作用。我们还引入了漏斗式CIF、上下文块、统一门控与双线性池化联合网络以及辅助训练策略,以进一步提升性能。在178小时的AISHELL-1和10000小时的WenetSpeech数据集上的实验表明,与RNN-T模型相比,CIF-T以更低的计算开销取得了最先进的性能。