Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
翻译:上下文信息在语音识别技术中起着至关重要的作用,如何将其融入端到端语音识别模型近期引起了广泛关注。然而,以往的深度偏置方法对偏置任务缺乏明确的监督。在本研究中,我们针对基于注意力的深度偏置方法引入了一种上下文短语预测网络。该网络利用上下文嵌入预测话语中的上下文短语,并计算偏置损失以辅助上下文模型的训练。我们的方法在各种端到端语音识别模型上均实现了显著的词错误率(WER)降低。在LibriSpeech语料库上的实验表明,与基线模型相比,所提模型的相对WER降低了12.1%,其中上下文短语的WER相对下降了40.5%。此外,通过应用上下文短语过滤策略,我们还有效消除了使用更大偏置列表时WER的退化问题。