End-to-end speech recognition models are improved by incorporating external text sources, typically by fusion with an external language model. Such language models have to be retrained whenever the corpus of interest changes. Furthermore, since they store the entire corpus in their parameters, rare words can be challenging to recall. In this work, we propose augmenting a transducer-based ASR model with a retrieval language model, which directly retrieves from an external text corpus plausible completions for a partial ASR hypothesis. These completions are then integrated into subsequent predictions by an adapter, which is trained once, so that the corpus of interest can be switched without incurring the computational overhead of retraining. Our experiments show that the proposed model significantly improves the performance of a transducer baseline on a pair of question-answering datasets. Further, it outperforms shallow fusion on recognition of named entities by about 7 relative; when the two are combined, the relative improvement increases to 13%.
翻译:端到端语音识别模型通常通过融合外部语言模型来引入外部文本资源以提升性能。然而,每当目标语料发生变化时,此类语言模型需要重新训练。此外,由于语言模型将整个语料存储在其参数中,罕见词的召回可能面临挑战。本文提出将检索语言模型增强至基于Transducer的ASR系统中,该模型可直接从外部文本语料中检索出与部分ASR假设匹配的合理补全结果。这些补全结果随后通过一个预先训练好的适配器模块整合至后续预测中,使得目标语料可在无需承担重训练计算开销的情况下灵活切换。实验表明,所提模型在两个问答数据集上显著提升了基线Transducer模型的性能。此外,在命名实体识别任务中,该模型相较于浅融合方法获得了约7%的相对提升;当两者结合时,相对提升幅度增至13%。