Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.
翻译:大语言模型(LLMs)的最新进展推动了自动语音识别(ASR)的生成式纠错(GER),其目标是从解码的 N-best 假设中预测真实转录文本。得益于 LLMs 强大的语言生成能力和 N-best 列表中的丰富信息,GER 在提升 ASR 效果方面展现出巨大潜力。然而,该方法仍存在两个局限:(1)GER 过程中 LLMs 无法感知源语音,可能导致结果语法正确但与源语音内容相悖;(2)N-best 假设通常在少数词元上存在差异,将所有假设输入 GER 会造成冗余,使 LLM 难以聚焦关键词元,进而增加误纠率。本文提出 ClozeGER——一种全新的 ASR 生成式纠错范式。首先,引入多模态大语言模型(如 SpeechGPT)接收源语音作为额外输入,以提升纠错输出的保真度;随后,将 GER 重构为含 logits 校准的完形填空测试,以消除输入信息冗余并通过清晰指令简化纠错过程。实验表明,ClozeGER 在 9 个主流 ASR 数据集上均显著超越了传统 GER 方法。