While external language models (LMs) are often incorporated into the decoding stage of automated speech recognition systems, these models usually operate with limited context. Cross utterance information has been shown to be beneficial during second pass re-scoring, however this limits the hypothesis space based on the local information available to the first pass LM. In this work, we investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search, and compare against results from n-best rescoring. Results demonstrate that beam search allows for an improved use of cross-utterance context. When evaluating on the long-format dataset AMI, results show a 0.7\% and 0.3\% absolute reduction on dev and test sets compared to the single-utterance setting, with improvements when including up to 500 tokens of prior context. Evaluations are also provided for Tedlium-1 with less significant improvements of around 0.1\% absolute.
翻译:尽管外部语言模型常被集成到自动语音识别系统的解码阶段,但这些模型通常基于有限上下文运行。已有研究表明,跨话语信息在二次重打分阶段具有应用价值,但该方式会因首次解码语言模型依赖局部信息而限制假设空间。本研究探索通过集成长上下文Transformer语言模型实现声学模型的跨话语波束搜索解码,并与N最佳重打分结果进行对比。实验表明,波束搜索能更有效地利用跨话语上下文。在长格式数据集AMI上的评估显示,相较于单话语场景,开发集和测试集分别获得0.7%和0.3%的绝对错误率下降,且当包含多达500个token的历史上下文时性能持续提升。TED-LIUM 1数据集上的评估结果改进幅度较小,约为0.1%的绝对错误率下降。