Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
翻译:准确识别罕见词和新词仍是上下文自动语音识别(ASR)系统面临的紧迫问题。大多数上下文偏置方法涉及修改ASR模型或波束搜索解码算法,这不仅增加了模型复用难度,还降低了推理速度。本文提出一种基于CTC单词定位器(CTC-WS)的快速上下文偏置新方法,适用于CTC和Transducer(RNN-T)ASR模型。该方法将CTC对数概率与紧凑的上下文图进行匹配,以检测潜在的上下文偏置候选词。随后,在对应的帧区间内,有效的候选词将替换其贪婪识别结果。混合Transducer-CTC模型使得CTC-WS方法能够应用于Transducer模型。实验结果表明,与基线方法相比,该方法在显著加速上下文偏置识别的同时,还提升了F分数和词错误率(WER)。所提方法已在NVIDIA NeMo工具包中公开提供。