Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

翻译：语言模型长期以来被用于提升自动语音识别系统的性能，但其无法感知ASR系统产生的识别错误。错误校正模型虽专为修正ASR错误而设计，却因缺乏监督训练数据而较传统语言模型改进有限。本文提出去噪语言模型，这是一种通过海量合成数据训练的$\textit{规模化}$错误校正模型，在显著超越先前尝试的同时实现了全新的ASR性能最优水平。我们采用文本转语音系统合成音频，将其输入ASR系统生成含噪假设文本，再与原始文本配对训练DLM。DLM包含以下$\textit{关键要素}$：（1）模型与数据的规模化扩展；（2）多说话人TTS系统的运用；（3）多种噪声增强策略的结合；（4）新型解码技术。基于Transformer-CTC ASR系统，DLM在Librispeech数据集的$\textit{test-clean}$和$\textit{test-other}$测试集上分别达到1.5%与3.3%的词错误率。据我们所知，这是在未使用外部音频数据条件下取得的最佳结果，甚至可与使用外部音频数据的自监督方法相媲美。此外，单一DLM可适配不同ASR系统，其性能显著超越基于传统语言模型的波束搜索重打分方法。这些结果表明，经过充分研究的错误校正模型具备替代传统语言模型的潜力，为ASR系统迈向更高精度水平提供了关键路径。