In streaming settings, speech recognition models have to map sub-sequences of speech to text before the full audio stream becomes available. However, since alignment information between speech and text is rarely available during training, models need to learn it in a completely self-supervised way. In practice, the exponential number of possible alignments makes this extremely challenging, with models often learning peaky or sub-optimal alignments. Prima facie, the exponential nature of the alignment space makes it difficult to even quantify the uncertainty of a model's alignment distribution. Fortunately, it has been known for decades that the entropy of a probabilistic finite state transducer can be computed in time linear to the size of the transducer via a dynamic programming reduction based on semirings. In this work, we revisit the entropy semiring for neural speech recognition models, and show how alignment entropy can be used to supervise models through regularization or distillation. We also contribute an open-source implementation of CTC and RNN-T in the semiring framework that includes numerically stable and highly parallel variants of the entropy semiring. Empirically, we observe that the addition of alignment distillation improves the accuracy and latency of an already well-optimized teacher-student distillation model, achieving state-of-the-art performance on the Librispeech dataset in the streaming scenario.
翻译:在流式场景中,语音识别模型必须在完整音频流可用之前将语音的子序列映射为文本。然而,由于训练期间语音与文本的对齐信息很少可用,模型需要以完全自监督的方式学习对齐。实践中,指数级别的对齐可能性使得这一任务极具挑战性,模型常学习到尖峰或不优的对齐。表面上看,对齐空间的指数性质使得量化模型对齐分布的不确定性变得困难。幸运的是,数十年来已知可通过基于半环的动态规划约简,在相对于转录器规模的线性时间内计算概率有限状态转录器的熵。本文重新审视用于神经语音识别模型的熵半环,并展示如何通过正则化或蒸馏利用对齐熵来监督模型。我们还贡献了半环框架下CTC和RNN-T的开源实现,包含数值稳定且高度并行的熵半环变体。实验表明,在已充分优化的师生蒸馏模型中引入对齐蒸馏可提升精度与延迟,在流式场景下于Librispeech数据集上实现了最先进的性能。