HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

翻译：深度神经网络的进步使自动语音识别（ASR）系统在多个公开的干净语音数据集上达到了人类水平。然而，即使是最先进的ASR系统，在面对不利条件时性能也会下降，因为训练良好的声学模型对语音域的变化（如背景噪声）十分敏感。直观上，人类通过依赖语言知识来解决这一问题：通过语境线索推断有歧义的语音术语含义，从而减少对听觉系统的依赖。受此启发，我们提出了首个利用外部大语言模型（LLM）进行ASR纠错的开源基准测试，其中N-best解码假设为真实转录预测提供了信息丰富的元素。这种方法是对传统语言模型重排序策略的范式转变——后者只能选择一个候选假设作为输出转录。所提出的基准测试包含一个新数据集HyPoradise（HP），涵盖超过334,000对N-best假设及其对应准确转录，覆盖主流语音领域。基于该数据集，我们探究了三种基于LLM的纠错技术，并使用不同数量的标注假设-转录对，显著降低了词错误率（WER）。实验证据表明，所提技术突破了传统基于重排序方法的上限，取得了突破性进展。更令人惊讶的是，具有合理提示的LLM及其生成能力甚至可以纠正N-best列表中缺失的词。我们公开了可复现管道的实验结果和预训练模型，从而为基于大语言模型的ASR纠错提供了新的评估范式。