End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.
翻译:端到端自动语音识别隐式学习配对音频-文本训练数据的词元序列分布,但仍难以应对训练域与测试域之间的域偏移问题,域自适应依然具有挑战性。为解决此问题,本文提出可替换内部语言模型方法,使得端到端语音识别模型在遭遇域偏移时,可直接在解码阶段用目标域语言模型替换其内部语言模型。进一步,本文设计了一种面向基于CTC的端到端语音识别模型的残差Softmax方法,使其无需重新训练即可在推理阶段适应目标域。在LibriSpeech语料库训练的端到端语音识别模型上开展的实验表明:所提方法在Switchboard数据上实现绝对词错误率降低2.6%,在AESRC2020语料库上实现词错误率降低1.0%,同时保持域内语音识别结果不变。