Deep biasing for the Transducer can improve the recognition performance of rare words or contextual entities, which is essential in practical applications, especially for streaming Automatic Speech Recognition (ASR). However, deep biasing with large-scale rare words remains challenging, as the performance drops significantly when more distractors exist and there are words with similar grapheme sequences in the bias list. In this paper, we combine the phoneme and textual information of rare words in Transducers to distinguish words with similar pronunciation or spelling. Moreover, the introduction of training with text-only data containing more rare words benefits large-scale deep biasing. The experiments on the LibriSpeech corpus demonstrate that the proposed method achieves state-of-the-art performance on rare word error rate for different scales and levels of bias lists.
翻译:基于Transducer的深度偏置技术可提升罕见词或上下文实体的识别性能,这对实际应用至关重要,尤其是流式自动语音识别(ASR)场景。然而,大规模罕见词的深度偏置仍具挑战性:当偏置列表中存在更多干扰项及字形序列相似的词汇时,性能会显著下降。本文结合Transducer中罕见词的音素与文本信息,以区分发音或拼写相似的词汇。此外,引入包含更多罕见词的纯文本数据训练,有助于提升大规模深度偏置效果。在LibriSpeech语料库上的实验表明,针对不同规模与层次的偏置列表,所提方法在罕见词错误率上均达到最优性能。