Attention-based encoder-decoder (AED) speech recognition model has been widely successful in recent years. However, the joint optimization of acoustic model and language model in end-to-end manner has created challenges for text adaptation. In particular, effectively, quickly and inexpensively adapting text has become a primary concern for deploying AED systems in industry. To address this issue, we propose a novel model, the hybrid attention-based encoder-decoder (HAED) speech recognition model that preserves the modularity of conventional hybrid automatic speech recognition systems. Our HAED model separates the acoustic and language models, allowing for the use of conventional text-based language model adaptation techniques. We demonstrate that the proposed HAED model yields 21\% Word Error Rate (WER) improvements in relative when out-of-domain text data is used for language model adaptation, and with only a minor degradation in WER on a general test set compared with conventional AED model.
翻译:基于注意力的编码器-解码器(AED)语音识别模型近年来取得了广泛成功。然而,声学模型与语言模型的端到端联合优化给文本适配带来了挑战。尤其是在工业界部署AED系统时,如何有效、快速且低成本地实现文本适配已成为首要关注问题。为解决此问题,我们提出了一种新颖模型——混合注意力编码器-解码器(HAED)语音识别模型,该模型保留了传统混合自动语音识别系统的模块化特性。HAED模型将声学模型与语言模型分离,从而能够使用基于文本的传统语言模型适配技术。实验表明,当使用域外文本数据进行语言模型适配时,所提出的HAED模型在词错误率(WER)上相对提升了21%,同时在与传统AED模型相比的一般测试集上仅产生微小的WER性能下降。