In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM. Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
翻译:本文研究了大规模语言模型(LLM)对YouTube视频自动语音识别(ASR)的影响,并将YouTube视频作为长语音ASR的数据源。我们证明,在美式英语(en-us)和印式码混合英语(en-in)长语音ASR测试集上,词错误率(WER)相对降低高达8%,而显著术语错误率(STER)相对降低高达30%,超过了一个采用最大熵语言模型的强基线首次解码结果。通过改进网格处理,生成具有正确(非树状)有向图拓扑结构的网格,并引入前一段落的一遍最优假设的上下文,在利用LLM进行重打分时取得了显著优势。我们还发现,基于海量可用数据(如C4)训练的LLM与传统神经语言模型相结合的性能提升具有叠加性,并且显著优于使用最大熵语言模型的强基线首次解码结果。©2023 IEEE。个人使用本材料是允许的。如需用于其他任何目的(包括当前或未来媒体中的广告或促销目的的转载/重版、创建新汇编作品、转售或分发至服务器或列表,以及在其他作品中重复使用本作品中的任何版权内容),必须获得IEEE的许可。