Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER). Previous works usually adopt end-to-end models and has strong dependency on Pseudo Paired Data and Original Paired Data. But when only pre-training on Pseudo Paired Data, previous models have negative effect on correction. While fine-tuning on Original Paired Data, the source side data must be transcribed by a well-trained ASR model, which takes a lot of time and not universal. In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. UCorrect has no dependency on the training data mentioned before. The whole procedure is first to detect whether the character is erroneous, then to generate some candidate characters and finally to select the most confident one to replace the error character. Experiments on the public AISHELL-1 dataset and WenetSpeech dataset show the effectiveness of UCorrect for ASR error correction: 1) it achieves significant WER reduction, achieves 6.83\% even without fine-tuning and 14.29\% after fine-tuning; 2) it outperforms the popular NAR correction models by a large margin with a competitive low latency; and 3) it is an universal method, as it reduces all WERs of the ASR model with different decoding strategies and reduces all WERs of ASR models trained on different scale datasets.
翻译:错误纠正技术已被用于优化自动语音识别(ASR)模型输出的句子,以降低词错误率(WER)。以往的研究通常采用端到端模型,并且高度依赖伪配对数据和原始配对数据。然而,仅在伪配对数据上进行预训练时,先前模型对纠正效果会产生负面影响。而在原始配对数据上进行微调时,源端数据必须由训练良好的ASR模型转录,这既耗时又缺乏普适性。本文提出UCorrect,一种无监督的检测器-生成器-选择器框架,用于ASR错误纠正。UCorrect不依赖上述训练数据。整个过程首先检测字符是否存在错误,然后生成若干候选字符,最后选择最可信的字符替换错误字符。在公开数据集AISHELL-1和WenetSpeech上的实验表明,UCorrect在ASR错误纠正中的有效性:1)显著降低WER,无需微调即可达到6.83%的降低率,微调后达到14.29%;2)以竞争性的低延迟大幅优于流行的NAR纠正模型;3)是一种通用方法,能降低采用不同解码策略的ASR模型的所有WER,以及在不同规模数据集上训练的ASR模型的所有WER。