Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data To solve this problem, this paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data.
翻译:尽管端到端(E2E)可训练自动语音识别(ASR)通过联合学习声学与语言信息取得了显著成功,但其仍受域迁移效应的影响,从而限制了潜在应用。E2E ASR模型隐式地学习了表征源域训练分布的内部语言模型(LM),而E2E可训练的特性使得该内部语言模型难以在仅有文本数据的条件下适配目标域。为解决该问题,本文提出了基于注意力机制的编码器-解码器(Decoupled-AED)与神经转录器(Decoupled-Transducer)模型的解耦结构,可在保持稳健域内性能的同时,于离线与在线场景中实现灵活的域适配。为此,我们将E2E模型解码器(或预测网络)中的声学与语言部分进行解耦,使语言组件(即内部LM)具备可替换性。当遭遇域迁移时,内部LM可在推理阶段被直接替换为目标域LM,无需重新训练或使用特定域的配对语音-文本数据。基于LibriSpeech-100h语料库训练的E2E ASR模型实验表明,所提出的解耦结构在TED-LIUM 2与AESRC2020语料库上分别取得了15.1%和17.2%的相对词错误率降低,同时保持域内数据的性能不变。