It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data. A widely used method to mitigate this issue is feeding contextual information into the acoustic model. Previous works have proven that a compact and accurate contextual list can boost the performance significantly. In this paper, we propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model. Specifically, we make use of the phone-level streaming output to first filter the predefined contextual word list then fuse it into non-casual encoder and decoder to generate the final recognition results. Our approach improve the accuracy of the contextual ASR system and speed up the inference process. Experiments on two datasets demonstrates over 20% CERR comparing to the baseline system. Meanwile, the RTF of our system can be stabilized within 0.15 when the size of the contextual word list grows over 6000.
翻译:在端到端自动语音识别(E2E ASR)系统中,准确识别训练数据中低频出现的实体词(如专有名词)具有挑战性。为缓解该问题,一种广泛采用的方法是将上下文信息注入声学模型。既往研究证明,构建紧凑且准确的语境词列表可显著提升系统性能。本文提出一种高效方法,面向统一流式/非流式E2E模型生成高质量语境词列表。具体而言,我们利用音素级流式输出对预定义的语境词列表进行初步筛选,随后将其融入非因果编码器与解码器以生成最终识别结果。该方法在提升上下文ASR系统准确率的同时,加速了推理过程。在两个数据集上的实验表明,与基准系统相比,本方案实现了超过20%的汉语纠错率(CERR)提升。当语境词列表规模超过6000词时,系统的实时因子(RTF)仍可稳定控制在0.15以内。