Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems. Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets). However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings. In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes). Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering. In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC. Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.
翻译:基于大语言模型的纠错技术是提升自动语音识别系统性能的新兴方法。通常,纠错训练数据通过自动配对大量ASR识别假设(作为源文本)及其对应黄金标注(作为目标文本)来收集。然而,此类配对的质量无法保证,我们观察到多种噪声类型可能导致纠错模型性能脆弱,例如在领域外场景中引发过度校正。本研究提出纠错训练数据应满足的两项基本准则:即纠错目标应(1)在语言可接受性上优于源文本,且(2)能够从可用上下文(如源文本音素)中推断得出。基于这些准则,我们识别低质量纠错配对,并训练模型在此类情况下不作任何修正,该过程称为保守数据过滤。实验中,我们以强基线模型Conformer-CTC构建日语ASR系统,并微调日语大语言模型进行纠错。通过在21项内部基准测试上的评估,我们证明该方法能显著减少过度校正现象,并在具有挑战性的领域外场景中提升ASR结果的准确性与质量。