The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
翻译:基于序列级准则训练的Transducer模型由于需要生成大型概率矩阵,导致内存占用巨大。我们提出了一种基于帧级准则的轻量化Transducer模型,该模型利用CTC强制对齐算法的结果确定每一帧对应的标签。这样编码器输出可与解码器输出在对应时间步直接结合,而无需像传统Transducer那样将编码器输出的每个元素与解码器输出的每个元素逐一相加。该方法显著降低了内存需求和计算复杂度。针对标签中空白符过多导致的分类不平衡问题,我们解耦了空白符与非空白符的概率计算,并截断空白分类器对主网络的梯度回传。在AISHELL-1数据集上的实验表明,该轻量化模型能达到与传统Transducer相当的性能。此外,通过引入更丰富的信息预测空白符概率,本模型取得了优于传统Transducer的结果。