Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model's attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.
翻译:神经上下文偏置能够有效提升自动语音识别(ASR)在说话人上下文中关键短语(尤其是训练数据中罕见短语)的识别效果。本文提出上下文文本注入(CTI)方法以增强上下文感知的ASR。CTI不仅利用配对的语音-文本数据,还利用更大规模的非配对文本语料库来优化ASR模型及其偏置组件。非配对文本被转换为类语音表示,用于引导模型注意力聚焦于相关偏置短语。此外,我们引入了上下文文本注入(CTI)最小词错误率(MWER)训练,该方法通过最小化非配对文本注入模型时由上下文偏置引起的预期词错误率来优化模型。实验表明,使用1000亿文本句子的CTI方法在强神经偏置模型基础上可实现高达43.3%的相对词错误率降低。CTI-MWER进一步带来23.5%的相对改进。