Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
翻译:当前时间同步序列到序列自动语音识别模型通过使用对所有对齐路径求和的序列级交叉熵进行训练。由于判别式建模的固有特性,将右侧标签上下文纳入训练准则的梯度计算会导致归一化问题,且在数学上缺乏严格定义。经典的混合神经网络隐马尔可夫模型凭借其生成式建模特性,能够实现对右侧标签上下文的条件化处理。然而,由于HMM状态绑定的限制,右侧标签上下文的具体身份从未被显式建模。本研究提出一种包含辅助性左右标签上下文的分解损失函数,该函数对所有对齐路径进行求和。我们证明,在训练数据资源有限的情况下,引入右侧标签上下文能带来显著性能提升。此外,我们还展示了仅依赖全求和准则构建分解式混合HMM系统的可行性。实验在Switchboard 300小时和LibriSpeech 960小时数据集上进行验证。