In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
翻译:在流式处理场景中,语音识别模型必须在完整音频流可用之前,将语音的子序列映射为文本。然而,由于训练过程中语音与文本之间的对齐信息很少可用,模型需要以完全自监督的方式学习对齐。实践中,指数级的可能对齐数量使得这一任务极具挑战性,模型往往学习到峰值化或次优的对齐。初步来看,对齐空间的指数性质使得量化模型对齐分布的不确定性本身都变得困难。幸运的是,数十年来已知的事实是:基于半环的动态规划约简方法,可以在与转换器规模呈线性关系的时间内计算概率有限状态转换器的熵。本研究重新审视了面向神经语音识别模型的熵半环,并展示如何通过正则化或蒸馏方法利用对齐熵来监督模型。我们还贡献了一套半环框架中CTC和RNN-T的开源实现,其中包含数值稳定且高度并行的熵半环变体。实验表明,在已充分优化的师生蒸馏模型中加入对齐蒸馏,可提升准确性和延迟性能,在流式场景下的Librispeech数据集上实现了最先进的性能。