Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
翻译:当前时间同步序列到序列自动语音识别模型通过使用对所有对齐路径求和的序列级交叉熵进行训练。由于判别式建模的固有特性,将右标签上下文纳入训练准则的梯度计算会导致归一化问题,且在数学上缺乏严格定义。经典的混合神经网络隐马尔可夫模型凭借其固有的生成式建模特性,能够实现对右标签上下文的条件化建模。然而,由于HMM状态绑定机制,右标签上下文的具体身份从未被显式建模。本研究提出了一种包含辅助左标签上下文和右标签上下文的对齐路径求和分解损失函数。我们证明,在训练数据资源有限的情况下,引入右标签上下文能带来显著性能提升。此外,我们还展示了仅依赖全求和准则构建分解式混合HMM系统的可行性。实验在Switchboard 300小时和LibriSpeech 960小时数据集上进行验证。