Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.
翻译:近期研究表明,大型语言模型(LLMs)在自动语音识别(ASR)的错误纠正方面具有显著效果。然而,现有研究大多集中于英语语言。本文则将关注点转向汉语。首先,我们构建了一个专门针对汉语ASR错误纠正的基准数据集,包含724K条假设-转写对,命名为汉语假设乐园数据集(ChineseHP)。该数据集涵盖广泛场景,并呈现出显著的挑战性。随后,我们利用该数据集对直接提示与微调预训练LLMs两种方法进行了初步评估。此外,我们提出了一种简单的提示拼音正则化方法,即直接从文本假设转写拼音。实验结果表明,与未正则化的模型相比,拼音正则化能持续提升LLMs的错误纠正能力。该数据集已在相关网站公开。