Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
翻译:上下文信息在语音识别技术中扮演着关键角色,如何将其融入端到端语音识别模型已成为近期研究热点。然而,以往的深度偏置方法缺乏对偏置任务的显式监督。本研究提出一种用于注意力机制深度偏置方法的上下文短语预测网络。该网络通过上下文嵌入预测语句中的上下文短语,并计算偏置损失以辅助上下文模型的训练。我们的方法在多种端到端语音识别模型上实现了显著的词错误率(WER)降低。在LibriSpeech语料库上的实验表明,所提模型相较基线模型取得了12.1%的相对WER改进,且上下文短语的WER相对降低40.5%。此外,通过应用上下文短语过滤策略,我们有效消除了使用更大偏置列表时出现的WER性能退化。