By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
翻译:通过融入额外上下文信息,深度偏置方法已成为个性化词语语音识别的有效方案。然而,在实际语音助手场景中,持续对高预测分数的个性化词语进行偏置会显著降低常用词的识别性能。针对该问题,我们提出一种基于上下文感知 Transformer 转换器(CATT)的自适应上下文偏置方法,该方法利用偏置编码器与预测器嵌入对上下文短语的流式出现进行预测。该预测结果动态控制偏置列表的启用与关闭,使模型能够适应个性化与通用场景。在 Librispeech 及内部语音助手数据集上的实验表明,与基线相比,本方法可使词错误率(WER)与字错误率(CER)分别相对降低 6.7% 和 20.7%,在通用场景下分别缓解 96.7% 和 84.9% 的相对 WER 与 CER 上升。此外,本方法在保持流式推理管线(RTF 增加可忽略)的同时,对个性化场景的影响极小。