End-to-End (E2E) automatic speech recognition (ASR) systems used in voice assistants often have difficulties recognizing infrequent words personalized to the user, such as names and places. Rare words often have non-trivial pronunciations, and in such cases, human knowledge in the form of a pronunciation lexicon can be useful. We propose a PROnunCiation-aware conTextual adaptER (PROCTER) that dynamically injects lexicon knowledge into an RNN-T model by adding a phonemic embedding along with a textual embedding. The experimental results show that the proposed PROCTER architecture outperforms the baseline RNN-T model by improving the word error rate (WER) by 44% and 57% when measured on personalized entities and personalized rare entities, respectively, while increasing the model size (number of trainable parameters) by only 1%. Furthermore, when evaluated in a zero-shot setting to recognize personalized device names, we observe 7% WER improvement with PROCTER, as compared to only 1% WER improvement with text-only contextual attention
翻译:端到端自动语音识别系统在语音助手应用中常难以识别用户个性化的低频词汇(如人名和地名)。罕见词汇往往具有非平凡发音,此时发音词典形式的人类知识可发挥重要作用。本文提出了一种发音感知上下文适配器(PROCTER),通过向RNN-T模型添加音素嵌入与文本嵌入,动态注入词典知识。实验结果表明,所提出的PROCTER架构在模型规模(可训练参数数量)仅增加1%的情况下,在个性化实体和个性化罕见实体上的词错误率分别较基线RNN-T模型降低44%和57%。此外,在零样本设置下评估个性化设备名称识别时,PROCTER实现了7%的WER改进,而纯文本上下文注意力机制仅带来1%的WER改进。