High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity.
翻译:高质量的人工转录对于训练和提升自动语音识别(ASR)模型至关重要。近期研究~\cite{libricrowd} 发现,使用转录数据训练ASR模型时,转录词错误率(WER)每增加1%,会导致ASR模型的WER上升约2%。即使对于训练有素的标注员,转录错误也难以完全避免。然而,目前鲜有研究探讨人工转录的修正问题。针对其他任务(如ASR错误修正和语法错误修正)开发的纠错方法,在此问题上的表现并不理想。为此,我们提出了HTEC方法用于人工转录错误修正。HTEC包含两个阶段:Trans-Checker是一个错误检测模型,用于预测并掩码错误词;Trans-Filler是一个序列到序列的生成模型,用于填充掩码位置。我们提出了一套全面的修正操作列表,其中包含四种处理删除错误的新操作。此外,我们还提出了一种嵌入变体,将音位信息融入Transformer的输入中。实验结果表明,HTEC在WER指标上大幅优于其他方法,并比人工标注员低2.2%至4.5%。最终,我们将HTEC部署为人工标注员的辅助工具,发现其作为协同助手特别有效,能在不降低转录速度的情况下将转录质量提升15.1%。