Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition

Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subword encoders for encoding the bias phrases. However, subword tokenizations are coarse and fail to capture granular pronunciation information which is crucial for biasing based on acoustic similarity. In this work, we propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing guided by acoustic similarity between the audio and the contextual entities (termed acoustic biasing). We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context along with contextual entities to perform biasing informed by the utterance's semantic context (termed semantic biasing). Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes over the baseline contextual model when incorporating our proposed acoustic and semantic biasing approach. On a large-scale in-house dataset, we observe 7.91% relative WER improvement compared to our baseline model. On tail utterances, the improvements are even more pronounced with 36.80% and 23.40% relative WER improvements on Librispeech rare words and an in-house testset respectively.

翻译：基于注意力的上下文偏置方法在端到端自动语音识别（E2E ASR）系统（如神经换能器）中，显著提升了对通用和/或个人罕见词的识别性能。这些方法通过交叉注意力机制使模型偏向于以偏置短语形式注入的特定上下文实体。现有方法通常依赖子词编码器对偏置短语进行编码，但子词分词较为粗糙，难以捕捉细粒度的发音信息——而这对基于声学相似性的偏置至关重要。本文提出使用轻量级字符表示编码细粒度发音特征，以增强基于音频与上下文实体间声学相似性引导的上下文偏置（称为声学偏置）。我们进一步整合基于预训练神经语言模型（NLM）的编码器，对语句的语义上下文及上下文实体进行编码，从而执行由语句语义上下文引导的偏置（称为语义偏置）。在Librispeech数据集上使用Conformer换能器模型进行的实验表明，在融入所提出的声学与语义偏置方法后，相较于基线上下文模型，不同偏置列表规模下的相对词错误率（WER）改善幅度达4.62%-9.26%。在大型内部数据集上，相较于基线模型，我们观察到7.91%的相对WER改善。对于尾部语句，改善更为显著：在Librispeech罕见词和内部测试集上分别实现了36.80%和23.40%的相对WER改善。