Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.
翻译:连接时序分类(CTC)是一种广泛用于自动语音识别(ASR)的方法,以其简单性和计算效率而著称。然而,其在识别性能方面往往存在不足。在本工作中,我们提出了一致性正则化CTC(CR-CTC),该方法通过强制来自输入语音梅尔频谱图不同增强视图的两个CTC分布之间的一致性来提升性能。我们从三个角度深入剖析了其核心行为:1)它在处理不同增强视图的随机子模型对之间进行自蒸馏;2)它通过对时间掩码区域内位置进行掩码预测来学习上下文表示,尤其是在我们增加时间掩码量时;3)它抑制了CTC分布中极端的峰值,从而减少过拟合并提升泛化能力。在LibriSpeech、Aishell-1和GigaSpeech数据集上进行的大量实验验证了我们提出的CR-CTC的有效性。它显著提升了CTC的性能,取得了与Transducer或结合CTC与基于注意力的编码器-解码器(CTC/AED)的系统相媲美的先进结果。我们在 https://github.com/k2-fsa/icefall 发布了代码。