In this work, we investigate the effect of language models (LMs) with different context lengths and label units (phoneme vs. word) used in sequence discriminative training for phoneme-based neural transducers. Both lattice-free and N-best-list approaches are examined. For lattice-free methods with phoneme-level LMs, we propose a method to approximate the context history to employ LMs with full-context dependency. This approximation can be extended to arbitrary context length and enables the usage of word-level LMs in lattice-free methods. Moreover, a systematic comparison is conducted across lattice-free and N-best-list-based methods. Experimental results on Librispeech show that using the word-level LM in training outperforms the phoneme-level LM. Besides, we find that the context size of the LM used for probability computation has a limited effect on performance. Moreover, our results reveal the pivotal importance of the hypothesis space quality in sequence discriminative training.
翻译:本文研究了在基于音素的神经换能器序列辨别训练中,不同上下文长度和标签单元(音素 vs. 词)的语言模型所产生的影响。我们对无格栅方法和N-best列表方法均进行了考察。针对基于音素级语言模型的无格栅方法,我们提出了一种近似上下文历史的方法,以利用具有全上下文依赖性的语言模型。该近似可扩展至任意上下文长度,并使得在无格栅方法中使用词级语言模型成为可能。此外,我们还在无格栅方法和基于N-best列表的方法之间进行了系统比较。在Librispeech上的实验结果表明,在训练中使用词级语言模型优于音素级语言模型。同时,我们发现用于概率计算的语言模型上下文大小对性能影响有限。此外,我们的结果揭示了假设空间质量在序列辨别训练中的关键重要性。