Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
翻译:上下文信息在语音识别技术中起着关键作用,如何将其融入端到端语音识别模型近年来引起了广泛关注。然而,以往的深度偏置方法缺乏对偏置任务的显式监督。本研究针对基于注意力机制的深度偏置方法,提出了一种上下文短语预测网络。该网络利用上下文嵌入预测话语中的上下文短语,并计算偏置损失以辅助语境化模型的训练。我们的方法在多种端到端语音识别模型上实现了显著的词错误率(WER)降低。在LibriSpeech语料库上的实验表明,所提模型相较于基线模型获得了12.1%的相对词错误率改进,且上下文短语的词错误率相对下降了40.5%。此外,通过应用上下文短语过滤策略,我们还有效消除了使用更大偏置列表时词错误率的性能退化。