SoftCorrect: Error Correction with Soft Detection for Automatic Speech Recognition

Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.

翻译：自动语音识别（ASR）中的纠错旨在修正ASR模型生成句子中的错误词汇。由于当前ASR模型的词错误率（WER）通常较低，为避免影响原本正确的标记，纠错模型应仅修改错误词汇，因此错误词检测对纠错至关重要。以往纠错工作要么通过目标-源注意力机制或CTC（连接时序分类）损失隐式检测错误词，要么显式定位特定的删除/替换/插入错误。然而，隐式错误检测无法提供哪些标记存在错误的明确信号，而显式错误检测则面临检测精度低的问题。本文提出SoftCorrect方法，采用软错误检测机制来规避显式与隐式错误检测的局限性。具体而言，我们首先通过专门设计的语言模型计算每个标记正确的概率来判定其正确性，随后设计一种约束CTC损失，仅复制被检测为错误的标记，使解码器聚焦于错误标记的修正。与基于CTC损失的隐式错误检测相比，SoftCorrect能显式反映哪些词汇存在错误，因而无需复制所有标记而仅需复制错误标记；与显式错误检测相比，SoftCorrect不检测具体的删除/替换/插入错误类型，而是将修正任务交由CTC损失处理。在AISHELL-1和Aidatatang数据集上的实验表明，SoftCorrect分别实现了26.1%和9.4%的字符错误率（CER）降低，大幅超越以往工作，同时保持了并行生成的快速性能。