Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
翻译:上下文信息在语音识别技术中起着关键作用,将其融入端到端语音识别模型近年引起了广泛关注。然而,以往的深度偏置方法缺乏对偏置任务的显式监督。本研究提出了一种基于注意力机制的深度偏置方法的上下文短语预测网络。该网络利用上下文嵌入预测话语中的上下文短语,并计算偏置损失以辅助上下文感知模型的训练。我们的方法在多种端到端语音识别模型上实现了显著的词错误率(WER)降低。在LibriSpeech语料库上的实验表明,与基线模型相比,所提模型获得了12.1%的相对WER改进,上下文短语的WER相对降低了40.5%。此外,通过应用上下文短语过滤策略,我们有效消除了使用较大偏置列表时的WER退化现象。