Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition.
翻译:近年来,深度学习与自动语音识别(ASR)领域的进展使得端到端(E2E)ASR系统成为可能,并将识别准确率提升至新高度。E2E系统通过一个基于音频-文本对训练的网络隐式建模所有传统ASR组件(如声学模型AM和语言模型LM)。尽管系统架构更为简化,但将独立基于文本语料库训练的LM融合至E2E系统已被证明具有显著优势。然而,LM融合的应用存在固有缺陷,例如无法解决E2E系统内部AM的领域失配问题。受LM融合概念的启发,我们提出将外部AM集成至E2E系统中以更有效地应对领域失配。通过实施这一创新方法,我们在多个测试集上实现了高达14.3%的词错误率显著降低。同时发现,该AM融合方法在提升命名实体识别性能方面尤为有效。