Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
翻译:上下文信息在语音识别技术中扮演着关键角色,将其融入端到端语音识别模型近来引起了广泛关注。然而,以往的深度偏置方法缺乏对偏置任务的显式监督。在本研究中,我们提出了一种用于基于注意力的深度偏置方法的上下文短语预测网络。该网络利用上下文嵌入预测语句中的上下文短语,并计算偏置损失以辅助上下文模型的训练。我们的方法在多种端到端语音识别模型上实现了显著的词错误率(WER)降低。在LibriSpeech语料库上的实验表明,所提模型相比基线模型获得了12.1%的相对WER改进,且上下文短语的WER相对降低了40.5%。此外,通过应用上下文短语过滤策略,我们还有效消除了使用较大偏置列表时的WER性能下降。