Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.
翻译:深度神经网络的进步使自动语音识别(ASR)系统在多个公开的干净语音数据集上达到了人类水平。然而,即使是最先进的ASR系统,在面对不利条件时性能也会下降,因为训练良好的声学模型对语音域的变化(如背景噪声)十分敏感。直观上,人类通过依赖语言知识来解决这一问题:通过语境线索推断有歧义的语音术语含义,从而减少对听觉系统的依赖。受此启发,我们提出了首个利用外部大语言模型(LLM)进行ASR纠错的开源基准测试,其中N-best解码假设为真实转录预测提供了信息丰富的元素。这种方法是对传统语言模型重排序策略的范式转变——后者只能选择一个候选假设作为输出转录。所提出的基准测试包含一个新数据集HyPoradise(HP),涵盖超过334,000对N-best假设及其对应准确转录,覆盖主流语音领域。基于该数据集,我们探究了三种基于LLM的纠错技术,并使用不同数量的标注假设-转录对,显著降低了词错误率(WER)。实验证据表明,所提技术突破了传统基于重排序方法的上限,取得了突破性进展。更令人惊讶的是,具有合理提示的LLM及其生成能力甚至可以纠正N-best列表中缺失的词。我们公开了可复现管道的实验结果和预训练模型,从而为基于大语言模型的ASR纠错提供了新的评估范式。