Neural transducer is now the most popular end-to-end model for speech recognition, due to its naturally streaming ability. However, it is challenging to adapt it with text-only data. Factorized neural transducer (FNT) model was proposed to mitigate this problem. The improved adaptation ability of FNT on text-only adaptation data came at the cost of lowered accuracy compared to the standard neural transducer model. We propose several methods to improve the performance of the FNT model. They are: adding CTC criterion during training, adding KL divergence loss during adaptation, using a pre-trained language model to seed the vocabulary predictor, and an efficient adaptation approach by interpolating the vocabulary predictor with the n-gram language model. A combination of these approaches results in a relative word-error-rate reduction of 9.48\% from the standard FNT model. Furthermore, n-gram interpolation with the vocabulary predictor improves the adaptation speed hugely with satisfactory adaptation performance.
翻译:神经转换器因其天然的流式处理能力,现已成为语音识别领域最流行的端到端模型。然而,使用纯文本数据进行适配仍具挑战性。因子化神经转换器(FNT)模型被提出以缓解该问题。FNT在文本适配数据上的改进适配能力是以降低标准神经转换器模型的准确性为代价的。我们提出了多种方法来提升FNT模型性能,包括:在训练阶段引入CTC准则、在适配阶段加入KL散度损失、利用预训练语言模型初始化词汇预测器,以及通过词汇预测器与n-gram语言模型插值实现高效适配方法。这些方法的组合使用使标准FNT模型的词错误率相对降低了9.48%。此外,词汇预测器与n-gram插值技术在大幅提升适配速度的同时保持了良好的适配性能。