Error correction models form an important part of Automatic Speech Recognition (ASR) post-processing to improve the readability and quality of transcriptions. Most prior works use the 1-best ASR hypothesis as input and therefore can only perform correction by leveraging the context within one sentence. In this work, we propose a novel N-best T5 model for this task, which is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By transferring knowledge from the pre-trained language model and obtaining richer information from the ASR decoding space, the proposed approach outperforms a strong Conformer-Transducer baseline. Another issue with standard error correction is that the generation process is not well-guided. To address this a constrained decoding process, either based on the N-best list or an ASR lattice, is used which allows additional information to be propagated.
翻译:纠错模型是自动语音识别(ASR)后处理的重要组成部分,旨在提升转录内容的可读性与质量。现有研究多以ASR最优假设(1-best hypothesis)作为输入,导致纠错只能依赖单句上下文信息。本文提出一种新颖的N-best T5模型,该模型基于T5微调,将ASR N-best列表作为输入。通过迁移预训练语言模型的知识并获取ASR解码空间的丰富信息,所提方法在强基线Conformer-Transducer模型上取得更优性能。此外,标准纠错模型存在生成过程缺乏有效引导的问题。为此,我们采用基于N-best列表或ASR格(lattice)的约束解码策略,使额外信息得以有效传播。