The CTC model has been widely applied to many application scenarios because of its simple structure, excellent performance, and fast inference speed. There are many peaks in the probability distribution predicted by the CTC models, and each peak represents a non-blank token. The recognition latency of CTC models can be reduced by encouraging the model to predict peaks earlier. Existing methods to reduce latency require modifying the transition relationship between tokens in the forward-backward algorithm, and the gradient calculation. Some of these methods even depend on the forced alignment results provided by other pretrained models. The above methods are complex to implement. To reduce the peak latency, we propose a simple and novel method named peak-first regularization, which utilizes a frame-wise knowledge distillation function to force the probability distribution of the CTC model to shift left along the time axis instead of directly modifying the calculation process of CTC loss and gradients. All the experiments are conducted on a Chinese Mandarin dataset AISHELL-1. We have verified the effectiveness of the proposed regularization on both streaming and non-streaming CTC models respectively. The results show that the proposed method can reduce the average peak latency by about 100 to 200 milliseconds with almost no degradation of recognition accuracy.
翻译:CTC模型因其结构简单、性能优异和推理速度快而被广泛应用于众多场景。CTC模型预测的概率分布中存在多个峰值,每个峰值代表一个非空白标记。通过鼓励模型更早地预测峰值,可以降低CTC模型的识别延迟。现有降低延迟的方法需要修改前向-后向算法中标记之间的转移关系以及梯度计算,部分方法甚至依赖其他预训练模型提供的强制对齐结果。上述方法实现复杂。为降低峰值延迟,我们提出一种简单新颖的方法——峰值优先正则化,该方法利用帧级知识蒸馏函数强制CTC模型的概率分布沿时间轴左移,而无需直接修改CTC损失和梯度的计算过程。所有实验均在中文普通话数据集AISHELL-1上进行。我们分别对流式和非流式CTC模型验证了所提正则化的有效性。结果表明,该方法可将平均峰值延迟降低约100至200毫秒,且几乎不降低识别准确率。