HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

from arxiv, Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21

Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

翻译：深度神经网络的进步使自动语音识别（ASR）系统在多个公开的纯净语音数据集上达到了人类水平。然而，即使是当前最先进的ASR系统，在面对不利条件时仍会出现性能退化，因为训练良好的声学模型对语音域中的变化（如背景噪声）非常敏感。直观上，人类通过依赖语言知识来解决这一问题：模糊口语词汇的含义通常借助上下文线索推断，从而减少对听觉系统的依赖。受此启发，我们引入了首个利用外部大型语言模型（LLM）进行ASR错误修正的开源基准，其中N最佳解码假设为真实转录预测提供了信息性元素。该方法是对传统语言模型重评分策略的范式转变——传统方法只能从候选假设中选出一个作为输出转录。本基准包含一个新颖数据集HyPoradise（HP），涵盖超过334,000对跨主流语音域的N最佳假设及其对应准确转录。基于该数据集，我们研究了三种基于LLM的错误修正技术，其使用的标注假设-转录对数量各不相同，并显著降低了词错误率（WER）。实验证据表明，所提技术突破了传统基于重排序方法的上限，实现了关键性进展。更令人惊讶的是，合理提示下的LLM及其生成能力甚至能修正N最佳列表中缺失的标记。我们公开发布了结果与预训练模型，支持可复现的流水线，从而为基于LLM的ASR错误修正提供了新的评估范式。