By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.
翻译:通过融合额外的上下文信息,深度偏置方法已成为个性化词汇语音识别的有前景方案。然而,对于真实世界的语音助手而言,若持续偏置此类具有高预测分数的个性化词汇,会显著降低常见词汇的识别性能。为解决该问题,我们提出一种基于上下文感知Transformer换能器(CATT)的自适应上下文偏置方法,该方法利用偏置后的编码器和预测器嵌入,对流式上下文中短语的出现进行预测。随后利用该预测动态切换偏置列表的启用与禁用,使模型具备适应个性化场景与常见场景的双重能力。在Librispeech及内部语音助手数据集上的实验表明:与基线相比,我们的方法分别使词错误率(WER)和字符错误率(CER)相对降低6.7%和20.7%,同时可缓解常见场景下WER和CER相对增加的96.7%和84.9%。此外,在维持流式推理流水线且实时因子(RTF)增可忽略不计的情况下,该方法对个性化场景的性能影响极小。